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Abstract 

The  parallel  programming  system  Linda  consists  of  a  number  of  processes  and  a  shared 
memory  called  the  tuple  space.  In  a  distributed  implementation  of  Linda,  processes  and  the 
tuple  space  reside  on  different  computing  nodes  connected  by  a  communications  network 
subject  to  a  variety  of  node  and  network  failures.  This  thesis  develops  a  scheme  to  make 
the  tuple  space  highly-available  in  the  presence  of  failures. 

High-availability  is  achieved  by  replication:  the  tuple  space  is  replicated  on  several  nodes 
so  that  failures  usually  do  not  disrupt  program  execution.  Our  replication  method  has  two 
parts:  the  operations  protocol  and  the  view  change  algorithm.  The  operations  protocol 
is  a  read-one-write-all  scheme,  that  is,  values  are  read  from  one  of  the  replicas  and  write 
operations  are  executed  at  all  replicas.  The  protocol  exploits  the  semantics  of  the  tuple  space 
operations  to  eliminate  unnecessary  delay  in  program  execution.  When  failures  occur,  the 
replicas  are  reorganized  and  their  states  are  updated.  This  process  is  called  a  view  change 
and  is  accomplished  by  the  view  change  algorithm.  A  view  change  guarantees  that  a  newly 
formed  view  consists  of  a  majority  of  the  replicas,  and  that  all  updates  survive  into  the 
new  view.  Together,  the  operations  protocol  and  the  view  change  algorithm  ensure  that 
operations  are  executed  in  the  correct  order,  updates  to  the  tuple  space  survive  failures,  and 
processes  only  see  the  correct  tuple  space  state  in  spite  of  failures.  In  addition,  operations 
are  performed  by  a  concurrent  background  process  whenever  possible. 
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Chapter  1 

Introduction 


In  the  parallel  programming  system  Linda  [4]  [13],  processes  (called  workers  in  this  thesis) 
are  uncoupled  in  time  and  space:  they  store  and  pick  up  logical  tuples,  units  of  data  in  Linda, 
in  a  shared-memory- like  data  structure  referred  to  as  the  tuple  space.  A  typical  Linda  system 
consists  of  several  workers  and  a  tuple  space.  The  tuple  space  is  directly  accessible  to  all 
the  workers  simultaneously.  The  workers  read  their  data  from,  and  deposit  the  results  into, 
the  tuple  space.  Computations  can  start  as  soon  as  all  the  data  needed  are  available.  Linda 
has  been  implemented  on  Encore  and  Sequent  shared-memory  multiprocessors,  the  S/Net 
bus-based  message- passing  network,  the  Intel  iPSC  hypercube  link-based  network,  and  the 
Ethernet-based  multi-computer  local  area  network  [4][7][3]. 

This  thesis  develops  a  mechanism  to  make  the  implementation  of  Linda  on  a  distributed 
system  possible.  A  distributed  system  is  a  collection  of  geographically  distributed  computing 
nodes  connected  to  a  communications  network.  A  communications  network  might  be  a  local 
area  net,  or  it  might  consist  of  a  number  of  local  area  nets  connected  by  a  long  haul  net. 

Some  of  the  potential  benefits  of  a  distributed  parallel  processing  system  are  the  follow¬ 
ing: 

•  The  existing  uni- processor,  probably  heterogeneous,  computers  can  be  used  to  process 
large  jobs  in  parallel  instead  of  acquiring  expensive  new  multi-processor  machines. 

•  The  placement  of  computers  is  not  geographically  restricted.  Numerous  computers 
from  different  geographic  locations  can  work  together  on  a  single  job.  For  instance, 
computers  scattered  in  various  buildings  and  floors  can  cooperate  on  tasks  requiring 
larger  computing  power  than  any  single  one  of  them  can  handle. 
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•  With  a  proper  fault-tolerant  mechanism,  failures  of  individual  computing  nodes,  pos¬ 
sibly  caused  by  loss  of  power  or  hardware  mulfunction,  will  not  disrupt  program 
execution. 

Distributed  parallel  processing  systems,  while  providing  the  above  benefits,  also  give  rise 
to  some  potential  problems.  In  addition  to  higher  communication  overhead  than  that  of 
multi-processor  systems  where  inter-processor  communication  is  commonly  done  via  fast 
speed  data  buses,  networks  sire  susceptible  to  failures:  messages  may  be  lost  or  duplicated, 
the  network  may  fail  (and  thus  disrupt  normal  communication  or  cause  systems  to  be 
partitioned  into  subgroups  that  cannot  communicate),  or  computing  nodes  may  crash.  It  is 
important  to  have  programs  continue  to  run  correctly  in  the  presence  of  network  and  node 
failures. 

This  thesis  addresses  the  problems  that  arise  from  the  system  failures  in  a  distributed 
implementation  of  the  Linda  tuple  space  and  presents  an  efficient  protocol  that  makes 
the  tuple  space  fault-tolerant,  and  thus  highly-available.  High  availability  is  achieved  by 
redundancy —  a  tuple  space  is  replicated  onto  several,  usually  geographically  distinct,  nodes 
so  that  some  of  the  replicas  are  able  to  provide  information  when  the  others  become  inac¬ 
cessible  due  to  failures. 

Replication  provides  high  availability  of  data,  but  may  cause  data  inconsistency  among 
replicas.  Failure  to  deliver  messages  or  network  partitions  cause  some  replicas  not  to  receive 
needed  information;  duplicate  messages  may  cause  some  replicas  to  receive  extra  informa¬ 
tion;  a  replica  may  have  kept  out-dated  information  after  the  recovery  from  its  failures.  The 
protocol  presented  in  this  thesis  solves  these  problems. 

The  protocol  consists  of  two  parts:  the  operations  protocol  and  the  view  change  algo¬ 
rithm.  The  operations  protocol  guarantees  the  correct  execution  of  the  operations  on  a 
replicated  tuple  space.  The  view  change  management  algorithm  guarantees  that  the  tuple 
space  replicas  contain  an  up-to-date  and  consistent  state,  and  that  effects  of  all  completed 
operations  survive  subsequent  failures. 

Our  protocol  provides  some  attractive  properties.  First,  the  replication  is  completely 
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hidden  from  the  user  program,  that  is,  the  replicated  tuple  space  appears  to  the  workers  as 
a  single  entity.  Second,  the  tuple  space  can  tolerate  simultaneous  failures,  and  progress  can 
be  made  as  long  as  a  majority  of  the  replicas  can  still  talk  to  one  another.  Third,  very  little 
delay  is  imposed  on  the  user  programs.  These  properties  make  a  distributed  implementation 
of  Linda  a  viable  alternative  to  an  implementation  on  a  multi-processor  machine. 

Two  of  the  current  Linda  implementations  were  designed  for  networks.  But  neither  of 
them  provi  *s  highly- available  tuple  spaces  in  a  general  communications  networks.  Com¬ 
pared  with  these  implementations,  our  protocol  tolerates  failures  that  axe  common  in  general 
networks  and  provides  good  performance. 

The  thesis  makes  three  contributions: 

1.  It  provides  a  fault-tolerant,  efficient,  distributed  implementation  for  Linda. 

2.  It  indicates  how  fault  tolerance  might  be  achieved  for  other  parallel  systems.  Many 
parallel  computations  are  long  lived;  fault- tolerance  is  especially  interesting  for  them. 
In  addition,  the  other  advantages  of  distribution  apply  to  any  parallel  system. 

3.  It  extends  the  work  on  replication  techniques  by  showing  what  can  be  done  when 
the  semantics  of  the  operations  (that  is,  the  tuple  space  operations)  are  taken  into 
account. 

The  remainder  of  the  thesis  is  organized  as  follows.  Chapter  2  introduces  Linda.  Chapter 
3  gives  an  overview  of  our  scheme,  and  outlines  the  two  parts  of  the  scheme:  the  operations 
protocol  and  the  view  change  management.  The  detailed  descriptions  of  these  two  parts  are 
given  in  chapters  4  and  5,  respectively.  Chapter  6  discusses  related  research  and  extensions 
of  our  work. 


Chapter  2 

Linda 


A  Linda  system  consists  of  several  processes,  which  we  will  refer  to  as  workers,  and  a  memory 
that  is  logically  shared  by  the  workers.  The  workers  cooperate  on  jobs,  communicating 
through  the  logically  shared  memory.  A  worker  with  data  stores  the  data  into  the  memory 
and  one  that  needs  to  receive  data  retrieves  them  from  the  memory.  There  is  no  centralized 
synchronization  among  the  workers  other  than  the  operations  on  the  memory.  Operations 
are  executed  as  soon  as  the  data  needed  are  available. 

This  chapter  describes  the  Linda  data  structure  and  its  operations,  uses  a  simple  example 
to  explain  how  a  Linda  program  runs,  and  introduces  the  notion  of  a  Linda  kernel.  More 
detailed  descriptions  of  Linda  can  be  found  in  [13]  and  [9]. 

2.1  Logical  Tuples  and  Operations 

The  basic  data  unit  in  Linda  is  a  logical  tuple ,  or  tuple  for  short.  A  tuple  contains  a  logical 
name  followed  by  one  or  more  ordered  data  elements,  which  can  be  either  data  values  such 
as  “1”,  “true”,  and  “John”,  or  formats ,  which  are  typed  variables  that  can  be  assigned  some 
data  value.  For  instance,  (“X”,  1,  true),  (“done”)  and  (“A”,  “John”,  formal  score)  are 
valid  tuples.  “X”  is  the  logical  name,  and  “1”,  “true”  are  the  data  values  of  the  first  tuple, 
“done”  is  the  logical  name  of  the  second  tuple.  In  the  third  tuple,  “A”  is  the  logical  name, 
“John”  is  a  data  value  while  “formal  score”  indicates  that  “score”,  a  previously  declared 
variable,  is  a  formal. 

The  term  template  is  used  to  refer  to  tuples  that  are  the  arguments  of  two  of  the  Linda 
operations  (see  below).  A  template  and  a  tuple  may  match  using  the  following  rules: 
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1.  both  mast  have  the  same  logical  names  and  the  same  number  of  fields, 

2.  corresponding  fields  must  be  type  consonant, 

3.  corresponding  data  items  must  be  equal,  and 

4.  there  must  be  no  corresponding  formals. 

For  example: 

•  (“X”,  1,  2,  3,  4,  5)  matches  (“X”,  1,  2,  3,  4,  5), 

•  (“X”,  formal  i,  3,  4,  5)  matches  (“X",  2,  3,  formal  j,  5), 

•  (“X”,  1,  true)  matches  (“X”,  formal  i,  formal  b), 

•  (“X”,  formal  i,  true)  matches  (“X”,  1,  true),  and 

•  (“X”,  formal  i,  true)  matches  (“X”,  1,  formal  b) 

where  i  and  j  are  previously  declared  as  integer  variables  and  b  is  a  variable  of  boolean  type. 
On  the  other  hand, 

•  (“X”,  1)  does  not  match  (“X”,  1,  true)  because  of  rule  (1), 

•  (“X”,  “abc”)  does  not  match  (“X”,  1)  because  of  rule  (2), 

•  (“X”,  1)  does  not  match  (“X”,  2)  because  of  rule  (3),  and 

•  (“X”,  formal  i)  does  not  match  (“X”,  formal  j)  because  of  rule  (4). 

Tuples  are  stored  in  a  logically  shared  memory  called  a  tuple  space.  Workers  interact 
with  tuples  in  a  tuple  space  via  three  basic  operations:  out,  in,  and  rd.  An  out  operation 
takes  a  tuple  as  its  argument,  and  an  in  or  a  rd  operation  takes  a  template  as  its  argument. 
Let  t  be  a  tuple,  and  s  be  a  template.  Out(t)  causes  tuple  t  to  be  added  to  the  tuple  space; 
the  executing  worker  continues  immediately.  In(«)  causes  some  tuple  t  that  matches  s  to  be 
withdrawn  from  the  tuple  space;  the  values  of  the  actuals  in  t  are  assigned  to  the  formals 
in  s,  and  the  executing  worker  continues.  If  no  matching  t  is  available  when  in(s)  executes, 
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the  executing  worker  suspends  until  one  is,  and  then  proceeds  as  before.  If  many  matching 
f  s  are  available,  one  is  chosen  arbitrarily.  Rd(s)  is  the  same  as  in(s),  with  actuals  assigned 
t  to  formals  as  before,  except  that  the  matching  tuple  remains  in  the  tuple  space. 

In  addition  to  the  above  three  operations,  [9]  also  lists  three  other  operations:  eval(p), 
inp(s),  and  rdp(s).  Eval(p)  starts  a  process  to  execute  the  procedure  p.  It  has  little  to 
do  with  the  tuple  space,  and  hence  will  be  ignored  in  the  thesis.  Inp(s)  and  rdp(a)  are 
similar  to  in(a)  and  rd(s),  respectively,  except  inp(s)  and  rdp(s)  are  non-blocking:  the 
executing  workers  do  not  block  if  there  is  no  matching  tuple  in  the  tuple  space.  If  there  is 
a  matching  tuple  to  a,  then  inp(a)  and  rdp(a)  will  behave  exactly  the  same  as  in(a)  and 
,  rd(a),  respectively.  Otherwise,  “nojnatchJound”  is  signalled.  Inp(a)  and  rdp(a)  will  not 

be  included  in  our  protocol.  We  will  discuss  these  two  operations  in  chapter  6. 

2.2  Programming  in  Linda 

The  Linda  operators  can  be  incorporated  into  a  high-level  language,  transforming  the 
language  into  a  parallel  programming  language.  A  simple  program  that  computes  the  inner- 
product  of  two  matrices  A  and  B  is  shown  in  Figure  2.1.  It  illustrates  the  use  of  these 
operators.  The  initialization  creates  several  workers,  stores  A’s  rows  and  B’s  columns  in  the 
tuple  space,  and  adds  the  tuple  (“Next”,  1),  where  1  is  the  next  element  to  be  computed,  to 
the  tuple  space.  A  worker  first  gets  the  next  task  by  doing  in(“Next”,  formal  NextElem). 
Then  it  reads  A’s  row  and  B’s  column  from  the  tuple  space.  The  result  is  put  back  to  the 
tuple  space  by  out(“result”,  DotProduct(row,col)).  These  results  can  then  be  used  by  some 
other  computation. 

2.3  Linda  Kernel 

A  Linda  kernel  serves  as  a  translator  between  Linda  operations  and  the  accesses  to  physical 
memories.  It  supplies  a  form  of  logically-shared  memory  without  assuming  any  physically- 
shared  memory  in  the  underlying  hardware. 

A  Linda  kernel  implemented  on  a  network  is  called  a  network  kernel.  The  only  existing 
kernel  implementations  that  approximate  a  network  kernel  are  the  S/Net  kernel  and  the 


15 


Initialization 

eval(workerQ) 

eval(workerQ) 


out(“A”,  1,  A’s-lst-row) 
out(“A”,  2,  A’s-2nd-row) 


out(“A”,  n,  A’s-nth-row) 
out(“B”,  1,  B’s-lst-col) 
out(“B”,  2,  B’s-2nd-col) 


out(“B”,  n,  B’s-nth-col) 
out(“Next”,  1) 


%  create  one  worker 
%  create  another  worker 
%  create  some  more  workers 
%  put  A’a  1st  row  into  the  tuple  space 
%  put  A’s  2nd  row  into  the  tuple  space 
%  more  of  A’s  rows  into  the  tuple  space 
%  put  A’s  nth  row  into  the  tuple  space 
%  put  B’s  1st  column  into  the  tuple  space 
%  put  B’s  2nd  column  into  the  tuple  space 
%  more  of  B’s  columns  into  the  tuple  space 
%  put  B’s  nth  column  into  the  tuple  space 
%  next  computation 


^Vorkfir 

in(“Next”,  formal  NextElem)  %  get  next  computation 

if  NextElem  =  -1  then  out(“Next”,  -1)  done  exit  end 

if  NextElem  <  n  *  n  then  out(“Next”,  NextElem  +  1)  else  out(“Next”,  -1)  end 
i  =  quotientjof((NextElem  -  l)/dim  +  1)  %  calculate  the  row  of  the  result 
j  =  remainderjof((NextElem  -  l)/dim  +  1)%  calculate  the  column  of  the  result 
rd(“A”,  i,  formal  row)  %  get  A’s  row  from  the  tuple  space 

rd(“B”,  j,  formal  col)  %  get  B’s  column  form  the  tuple  space 

out  (“result",  i,  j,  DotProduct(row,  col))  %  put  the  result  into  the  tuple  space 


Figure  2.1:  A  program  segment  that  computes  a  matrix  inner-product  using  Linda  operators 
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VAX-LAN  kernel  described  in  [4][7],  But  neither  kernel  provides  a  highly- available  tuple 
space  in  the  face  of  failures.  (We  will  discuss  these  two  kernels  in  chapter  6.)  The  inability  of 
the  existing  mechanisms  to  cope  with  network  failures  motivates  the  design  of  a  new  scheme 
for  a  network  kernel  that  makes  the  tuple  space  continue  to  be  available  and  uncorrupted 
in  the  face  of  failures  such  as  node  crashes  and  network  partitions.  Our  scheme  provides 
highly-available  tuple  space  without  sacrificing  performance. 


Chapter  3 

Overview 


Replication  is  the  standard  technique  to  increase  data  availability.  By  replication,  we  mean 
maintaining  several  physical  copies,  usually  distributed  over  a  set  of  nodes  at  distinct  loca¬ 
tions,  of  each  logical  tuple.  When  one  or  more  copies  of  a  logical  tuple  becomes  unavailable 
due  to  node  or  network  failures,  the  rest  of  the  copies  can  still  provide  information.  For 
simplicity,  we  assume  that  the  tuple  space  is  uniformly  replicated,  that  is,  each  replica  con¬ 
tains  an  entire  copy  of  the  tuple  space.  In  chapter  6,  we  will  see  that  this  constraint  can  be 
relaxed  so  that  each  tuple  can  be  stored  on  a  subset  of  the  replicas. 

Replication  solves  the  availability  problem,  but  gives  rise  to  the  others  that  do  not 
exist  in  a  single-copy  tuple  space  scheme.  These  problems  include  inconsistencies  caused 
by  delayed  or  lost  messages,  or  out-dated  replicas  on  nodes  that  recover  from  failures.  Our 
scheme  is  designed  to  solve  these  problems. 

The  scheme  consists  of  two  parts:  the  operations  protocol  and  the  view  change  algo¬ 
rithm.  The  operations  protocol  translates  each  logical  operation  into  physical  operations. 
For  example,  an  in(s)  operation  issued  on  a  worker  is  translated  into  several  physical  in(s) 
operations  performed  on  all  the  replicas.  The  view  change  algorithm  is  adopted  from  the 
virtual  partitions  protocol  described  in  [1]  and  [2j.  It  guarantees  the  integrity  of  the  acces¬ 
sible  part  of  a  tuple  space  during  topological  changes  of  the  system.  The  term  view  will 
become  understood  as  the  chapter  progresses. 

This  chapter  gives  an  overview  of  our  scheme.  It  starts  by  discussing  the  system  model, 
the  failure  assumptions,  and  the  definitions  of  partitions  and  views.  Then  it  lists  the  goals  we 
would  like  to  achieve.  Next  we  give  an  overview  of  our  implementation  of  Linda  operations, 
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Figure  3.1:  Workers  and  Replicated  Tuple  Space 

and  explain  why  the  scheme  works.  An  analysis  of  a  set  of  constraints  on  the  operations  on 
a  replicated  tuple  space  follows.  An  overview  of  the  view  change  algorithm  is  then  given. 
Both  the  operations  protocol  and  the  view  change  algorithm  will  be  discussed  in  detail  in 
the  next  two  chapters.  Finally,  we  discuss  the  correctness  conditions  for  our  scheme. 

3.1  Preliminaries 

3.1.1  System  Model 

Our  system  consists  of  a  set  of  tuple  space  replicas  and  a  set  of  workers  as  illustrated  in 
Figure  3.1.  Squares  r(,  rj,  r3,  ...  are  tuple  space  replicas;  circles  tnj,  w 1V3,  ...  represent 
workers.  Each  tuple  space  replica  or  worker  resides  on  some  physical  node.  A  physical  node 
can  contain  any  number  of  replicas  or  any  number  of  workers  or  both.  All  physical  nodes 
are  connected  by  a  communications  network  subject  to  a  variety  of  failures  as  discussed 
below. 

Each  replica  is  identified  by  its  unique  replica  id.  The  replica  ids  are  totally  ordered. 
That  is,  if  r\Jd  and  r^.id  are  replica  ids  of  two  replicas  ri  and  r 2,  then  there  is  a  relation  -< 
such  that  either  r\Jd  -<  r%Jd  or  rjJd  -<  ri_id  but  not  both.  -<  is  transitive:  if  rj_*d  -<  rj-id 
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and  r^.id  ■<  r$Jd,  then  r\Jd  -<  r^Jd. 

3.1.2  Failure  Assumptions 

Failures  can  occur  in  many  ways:  node  crashes,  lost  and  duplicated  messages,  and  even 
Byzantine  failures  [18],  where  system  components  may  act  in  arbitrary,  even  malicious,  ways. 
We  will  consider  failures  that  have  a  reasonable  chance  of  occurring  in  practical  systems  and 
that  can  be  handled  by  algorithms  of  moderate  complexity  and  cost.  The  failures  satisfying 
these  criteria  include  node  and  network  crashes,  lost  or  duplicate  messages,  message  delays, 
and  network  partitions  [1][10].  Byzantine  failures  are  excluded.  We  assume  that  the  nodes 
are  failstop  [24],  that  is,  they  fail  by  halting.  Node  and  network  crashes,  lost  messages,  and 
delayed  messages,  cause  messages  not  to  be  received  by  the  receiver  within  a  reasonable 
time  interval.  Duplicate  messages  cause  certain  messages  to  be  received  more  than  once. 
Network  partitions  divide  a  system  into  several  subgroups  where  communication  is  possible 
within  each  subgroup,  but  impossible  between  any  pair  of  the  subgroups. 

In  general,  it  is  impossible  for  a  node  to  tell  whether  a  failure  to  receive  a  message  is 
due  to  a  node  crash  or  a  network  partition.  This  is  because  the  effect  of  the  failures,  as  a 
node  perceives  it,  is  the  same  —  no  message  is  received.  Whether  any  message  was  ever 
sent,  or  was  sent  but  not  delivered,  cannot  be  determined  by  an  individual  node.  Thus,  our 
scheme  will  not  rely  on  distinguishing  crashes  from  partitions. 

3.1.3  Partition  vs.  View 

A  partition  of  a  tuple  space  is  a  subset  of  replicas  that  can  communicate  with  each  other. 
We  assume  that  the  can-communicate  relation  between  any  two  nodes  is  transitive  and 
commutative.  That  is,  if  replica  a  can  communicate  with  replica  b,  and  replica  b  can 
communicate  with  replica  c,  then  b  can  communicate  with  a,  c  can  communicate  with  b,  a 
can  communicate  with  c,  and  c  can  communicate  with  a.  Thus,  every  replica  in  a  partition 
can  communicate  with  every  other  replica  in  the  same  partition. 

Partitions  evolve  dynamically.  Initially,  there  is  one  partition  containing  all  the  tuple 
space  replicas.  The  initial  partition  may  be  divided  into  several  smaller  partitions.  The 
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smaller  partitions  may  then  merge  to  form  larger  partitions,  or  further  subdivide  into  even 
smaller  partitions.  Figure  3.2  shows  an  example  of  the  partition  evolution  process.  The 
initial  partition  (rj,  r2,  r3,  r4,  r5)  becomes  two  partitions  (rj,  r2)  and  (r3,  r4,  r5)  after 
some  failure  at  time  t\.  In  this  partition  situation,  rj  and  r2  can  communicate  with  each 
other  and  r3,  and  r3  can  communicate  with  each  other,  but  none  of  the  replicas  in  the 
first  partition  can  communicate  with  any  of  the  replicas  in  the  second  partition.  At  some 
time  tj,  two  new  partitions,  (r2,  r2,  r3)  and  (r4,  rs),  are  formed.  Again,  communication  is 
possible  among  the  replicas  in  the  first  partition  and  among  those  in  the  second  partition, 
but  there  is  no  possible  communication  between  a  replica  in  the  first  partition  and  a  replica 
in  the  second  partition. 

The  view  of  a  worker  w  is  defined  to  be  a  set  of  replicas  that  w  thinks  that  it  can  access. 
A  view  of  a  replica  r  is  defined  to  be  a  set  of  replicas  that  r  thinks  that  it  can  access  *.  A 
view  always  contains  a  majority  of  replicas  in  the  system  (to  be  explained  in  Chapter  5). 

Worker  and  replica  views  can  change  over  time.  Replicas  can  initiate  a  view  change 
algorithm  when  they  think  that  there  is  a  change  in  the  network  topology.  The  view  change 
algorithm  will  be  explained  in  more  detail  later.  For  now,  it  suffices  to  know  that  as  the 
result  of  a  view  change,  a  new  view  may  be  established  and  the  replicas  in  the  new  view 
will  agree  on  a  common  view. 

Associated  with  each  view  is  an  unique  viewid.  A  viewid  contains  a  sequence  number  n 
and  the  replica  id,  rjd,  of  the  replica  that  initiated  the  view.  That  is: 

viewid  =  record[n  :  int,rjd  :  replica  Jd] 

Viewids  are  totally  ordered  by  the  relation  < : 

idi  <  id 3  s  (idi.n  <  idj. n)  V  ((idi.n  =  idj.n)&(idi.rJd  -<  idj.rJd)) 

where  idt  and  idj  are  viewids,  idi.n  and  idj.n  are  sequence  numbers  of  idi  and  idj,  respec¬ 
tively,  and  idi.rJd  and  idj.rJd  are  replica  ids  of  the  replicas  that  initiated  idi  and  idj, 
respectively. 


‘View*  »rt  referred  to  m  virtual  partition *  in  [1]. 


Partitions  at  time  tt 


Partitions  at  time  t2 


Figure  3.2:  An 
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Figure  3.3:  Inconsistency  Scenario  One:  Concurrent  in  operation  extract  the  same  tuple 
from  the  replicas. 

It  is  important  to  realize  that  views  and  partitions  are  different  concepts.  Partitions 
represent  the  physical  configurations  of  a  system  while  views  are  what  workers  and  replicas 
think  the  system  configurations  are.  For  instance,  if  rlt  r2,  r3,  r4  and  r5  are  replicas  of 
some  tuple  space,  and  at  some  instance  there  are  two  partitions  (r2,  r4,  r5)  and  (r2,  r3), 
then  the  views  of  rlt  r4,  and  r5  may  be  {rlt  r4,  rs},  and  those  of  r2  and  r3  may  be  {rt, 
r2,  r3}.  The  inconsistencies  between  views  and  partitions  result  for  many  reasons.  One 
possibility  is  that  changes  in  network  topology  happen  abruptly  and  replicas  and  workers 
cannot  detect  the  changes  instantly.  Another  possibility  is  that  lost  messages  may  change 
workers’  and  replicas’  views  of  the  “world”  even  when  no  physical  change  takes  place. 

3.2  Design  Goals 

The  design  of  our  network  Linda  kernel  is  driven  by  the  following  set  of  high-level  goals: 

•  Availability  —  The  tuple  space  should  have  a  high  probability  of  being  available 
despite  failures.  Our  goal  is  that  as  long  as  the  majority  of  replicas  (for  example, 
3  out  of  5  or  251  out  of  500)  can  communicate  with  each  other,  the  tuple  space  is 
available. 

•  Consistency  —  The  replicated  tuple  space  ought  to  present  a  consistent  state  to 
the  workers.  The  user  programs  should  not  be  aware  of  whether  the  tuple  space 
is  replicated  or  not,  except  for  the  higher  availability  of  a  replicated  tuple  space. 


23 


© 


Figure  3.4:  Inconsistency  Scenario  Two:  The  same  in  operation  extracts  different  tuples 
from  the  replicas. 

Therefore,  multiple  copies  of  the  tuple  space  should  not  cause  any  anomalies  for  out, 
in,  and  rd  operations.  The  result  of  these  operations  must  be  the  same  as  if  there  were 
only  one  copy  of  tuple  space  available.  For  instance,  concurrent  in  operations  must 
not  extract  the  same  tuple  from  different  replicas  (Figure  3.3  illustrates  an  anomaly 
where  the  same  tuple,  (“x”,  1),  on  ri  and  r2  is  extracted  by  concurrent  in  operations 
on  w\  and  uj2),  and  the  same  in  should  not  delete  different  tuples  from  different 
replicas  (the  problem  can  be  seen  in  Figure  3.4  where  two  different  tuples  on  n  and 
r2  are  extracted  by  the  same  in  operation  on  u>i). 

•  Efficiency  —  Operations  should  perform  efficiently  to  support  requirements  of  the 
parallel  programming  paradigm.  Except  for  satisfying  a  set  of  semantic  constraints 
(as  will  be  discussed  below),  no  delays  should  be  imposed  on  the  user  programs. 

Having  enumerated  the  goals,  we  are  ready  to  give  an  overview  of  how  operations  are 
performed  in  a  network  Linda  kernel. 

3.3  General  Scheme  for  the  Operations 

The  principal  idea  behind  our  network  kernel  is  to  use  an  operations  protocol  in  conjunction 
with  the  view  change  algorithm.  This  section  gives  the  reader  an  overview  of  how  Linda 
operations  are  implemented  on  a  replicated  tuple  space.  We  assume  that  workers  do  not 
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fail;  we  discuss  a  method  to  cope  with  workers’  failures  in  Section  6.3.2. 

In  this  thesis,  we  assume  that  a  tuple  space  is  implemented  as  a  set  of  tuple  sets.  Each 
tuple  set  contains  all  the  tuples  with  the  same  logical  name.  There  is  a  lock  associated  with 
each  tuple  set.2  When  a  tuple  set  is  locked  by  a  worker,  further  in  operations  of  all  other 
workers  involving  that  tuple  set  are  blocked  until  the  lock  is  released  by  the  locking  worker. 

To  simplify  the  presentation,  we  do  not  concern  ourselves  with  view  changes  in  this 
section.  The  assumption  is  that  workers’  views  are  accurate  and  no  event  occurs  that 
would  invalidate  them.  This  assumption  allows  us  to  understand  the  operations  without 
getting  involved  in  the  details  of  the  view  change. 

3.3.1  Operations 

Let  w  be  a  worker  executing  the  operation.  The  three  operations  on  a  replicated  tuple  space 
are  implemented  as  follows: 

•  Out(t)  —  The  request  to  execute  the  operation  is  broadcast  to  all  the  replicas  in  tu’s 
view,  and  w  waits  for  acknowledgments  from  the  replicas. 

At  each  replica,  t  is  stored  into  the  local  copy  of  the  tuple  space,  and  an  acknowledg¬ 
ment  is  sent  to  w. 

If  w  does  not  receive  acknowledgments  from  all  the  replicas  in  its  view,  it  repeats  the 
request  until  all  the  acknowledgments  have  been  received.  It  is  replicas’  responsibility 
to  discard  redundant  requests  for  the  same  out. 

•  In(s)  —  This  is  done  in  two  phases: 

-  Phase  One  (ini)  —  W  sends  template  s  to  all  the  replicas  in  its  view. 

Each  replica  searches  its  local  copy  of  the  tuple  space  for  matching  tuples.  The 

tuple  set  for  tuples  with  s’s  logical  name  is  locked,  and  a  set  containing  all 

matching  tuples  is  returned  to  w.  If  there  is  no  matching  tuple,  an  empty  set 

2We  could  use  a  finer  grain  of  locking  in  which  we  lock  just  the  tuples  that  might  match  the  template; 
such  locks  are  known  as  predicate  locks  [1 1], 
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is  returned.  If  the  tuple  set  is  already  locked  by  another  worker,  w's  request  is 
refused. 

If  all  the  replicas  in  the  view  respond,  none  of  the  replies  is  a  refusal,  and  there 
is  a  non-empty  intersection  of  all  the  tuple  sets  to  received,  then  an  arbitrary 
tuple  in  the  intersection  is  selected,  the  actuals  of  the  selected  tuple  are  assigned 
to  the  formals  of  s,  and  phase  two  starts. 

If  all  the  replicas  in  the  view  have  not  responded  within  a  reasonable  time  or  if 
all  replicas  responded  and  the  intersection  is  empty,  phase  one  is  repeated  after 
a  timed  delay. 

If  a  majority  of  the  replicas  in  to’s  view  refused  to’s  request,  then  w  instructs  the 
replicas  to  release  the  locks,  and  phase  one  will  be  repeated  after  some  random 
time  interval. 

If  a  minority  of  the  replicas  refused,  then  to  repeats  the  first  phase  until  it  gets 
locks  on  all  the  replicas  in  its  view. 

-  Phase  Two  (in2)  —  W  informs  all  the  replicas  in  the  view  about  the  selection  in 
phase  one.  The  replicas  remove  the  selected  tuple  from  their  copies  of  the  tuple 
space,  release  the  locks  set  during  the  first  phase,  and  send  an  acknowledgment 
to  to.  An  in2  is  finished  only  when  all  the  replicas  have  replied.  Otherwise, 
it  is  repeated  until  they  have.  Again,  repeated  requests  for  the  same  in2  are 
discarded  by  the  replicas. 

It  would  be  a  violation  of  our  consistency  goal  for  an  in  to  delete  a  different 
matching  tuple  from  each  replica.  Instead,  the  same  tuple  must  be  removed  by 
all  the  replicas  in  the  view,  ini’s  mission  is  to  ensure  that  this  constraint  is 
met.  A  selection  can  be  made  only  when  the  executing  worker  has  a  lock  on  the 
same  tuple  at  every  replica  in  its  view;  a  non-empty  intersection  guarantees  this 
condition.  No  selection  can  be  made  if  the  intersection  is  empty;  the  worker  must 
be  blocked  until  all  the  replicas  have  replied  to  the  ini  request  and  a  selection 
is  made. 
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The  locks  keep  the  tuples  under  consideration  from  being  removed  by  other 
concurrent  in  operations.  If  there  are  concurrent  inis  concerning  the  same  tuple 
set,  each  might  acquire  locks  at  some  replicas,  and  neither  would  be  able  to 
complete.  In  other  words,  there  would  be  a  deadlock.  To  resolve  such  a  situation, 
we  release  locks  when  the  worker  has  acquired  them  only  at  a  minority  of  replicas; 
this  will  enable  a  worker  with  a  majority  to  succeed  in  acquiring  locks  at  all 
replicas.  The  case  of  several  competing  workers  who  repeatedly  acquire  only  a 
minority  of  locks  can  be  avoided  by  introducing  a  random  delay,  so  that  workers 
make  their  next  attempts  to  set  the  lock  at  different  times. 

•  Rd(s)  —  Template  s  is  broadcast  to  all  the  replicas  in  w’s  view.  Each  replica  searches 
for  a  matching  tuple  in  its  local  copy  of  the  tuple  space.  If  a  matching  tuple  is  found, 
a  copy  of  it  is  sent  back  to  w.  Otherwise,  it  informs  w  that  no  matching  tuple  is 
found. 

Whenever  w  receives  a  tuple  from  any  of  the  replicas,  it  assigns  the  actuals  of  the 
returned  value  to  the  formals  of  s,  and  the  execution  continues.  Responses  from  the 
rest  of  the  replicas  are  ignored. 

If  no  tuple  is  received  within  a  reasonable  delay,  the  rd  is  repeated  until  one  is. 

Notice  that  a  modification  operation  (out  or  in)  is  complete  only  after  it  has  occurred 
at  all  replicas,  and  that  a  worker  continues  to  perform  the  operation  at  all  replicas  in  its 
current  view  until  it  knows  the  operation  is  complete. 

3.3.2  Properties  of  the  Operations 

From  the  basic  operations  scheme  stated  above,  we  can  see  that  an  out(t)  operation  does 
not  concern  itself  with  the  current  tuple  space  state.  It  simply  deposits  t  into  the  tuple 
space.  It  is  analogous  to  a  blind  write,  a  write  operation  that  does  not  read  the  value  of  the 
written  object  firBt.  Therefore,  there  is  no  need  for  a  worker  issuing  am  out  operation  to 
wait  until  the  operation  is  finished.  The  execution  of  an  out  operation  can  be  carried  out 
in  the  background  while  program  execution  continues. 
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There  is  no  need  for  the  executing  worker  to  be  blocked  while  an  in2  is  in  process 
because  the  in2  will  not  provide  any  information  that  is  needed  by  the  worker.  Thus  in2’s 
can  be  completed  in  the  background.  Completing  an  in2  guarantees  that  the  selected  tuple 
is  removed  from  all  the  replicas  in  the  view  and  the  locks  set  by  the  corresponding  ini’s 
are  released. 

It  is  not  hard  to  see  that  the  worker  executing  a  rd  operation  must  be  blocked  until 
the  first  matching  tuple  is  returned  from  a  replica.  Similarly,  a  worker  executing  an  in 
operation  must  be  blocked  until  the  tuple  to  be  removed  is  selected. 

The  background  processing  of  out  and  in2  allows  multiple  operations  to  be  packaged 
in  one  message.  It  also  introduces  concurrency  between  running  a  worker  and  its  use  of  the 
tuple  space.  However,  the  executions  of  the  background  operations  need  to  satisfy  a  set  of 
constraints  that  ensure  the  Linda  semantics  are  preserved  in  the  face  of  concurrency.  For 
example,  if  we  do  not  control  concurrent  execution,  a  rd  operation  may  read  a  tuple  that 
was  supposed  to  be  removed  by  a  previous  in  operation  issued  by  the  same  worker  because 
the  background  in2  has  not  completed  by  the  time  the  rd  is  executed. 

3.4  Constraints  on  Operations 

To  determine  how  much  concurrency  we  can  achieve  without  violating  correctness,  we  need 
to  define  constraints  on  each  operation.  A  plausible  requirement  is  that  the  state  of  the 
tuple  space  observed  by  each  worker  does  not  conflict  with  what  it  has  done  or  observed  in 
the  past3  We  let  this  requirement  be  our  correctness  criterion.  We  will  first  take  a  look  at 
the  sequential  constraints ,  the  constraints  on  the  operations  of  a  single  worker,  and  then 
the  inter-worker  constraints ,  those  imposed  on  the  operations  of  different  workers. 

3.4.1  Sequential  Constraints 

This  subsection  investigates  the  constraints  in  an  environment  that  has  one  worker  and  a 
possibly  replicated  tuple  space.  Out  and  in2  are  executed  in  the  background  concurrently. 
Concurrent  out’s  will  not  cause  problems.  This  is  because  both  rd  and  in  are  nondeter- 
3Thi»  requirement  is  known  u  one-copy  serialiiability  [6]. 
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ministic  and  blocking.  Rd(s)  can  use  any  matching  tuple  in  the  tuple  space  at  the  moment. 
If  an  out(t)  was  issued  by  the  worker  previously  and  t  matches  s,  rd(s)  may  use  t  (if  t  is 
already  in  the  tuple  space),  or  it  will  simply  wait  (if  t  has  not  yet  been  stored  into  the  tuple 
space  and  there  is  no  other  matching  tuples  in  the  tuple  space)  until  t  arrives.  Similarly, 
inl(s)  can  lock  any  matching  tuple  in  the  tuple  space  at  the  moment.  It  will  wait  for  a 
matching  tuple  to  arrive  (at  all  replicas)  if  there  is  not  one  already.  Since  rd  is  blocking, 
no  later  out’s  may  start  until  the  current  rd  operation  has  returned.  Similarly,  in’s  will 
block  later  out  operations  until  ini  has  returned. 

Unfinished  in2’s  may  cause  problems  in  that  the  tuple  that  was  supposed  to  be  removed 
by  an  in  operation  may  still  be  in  the  tuple  space  when  a  later  rd  is  executed.  (A  later  in 
is  not  a  problem  because  the  locks  will  prevent  it  from  seeing  the  effects  of  the  earlier  in2  if 
both  concern  the  same  tuple  set.)  This  is  undesirable.  To  prevent  this  problem,  we  require 
that  the  operations  be  executed  at  each  replica  in  the  same  order  as  they  were  issued  by  the 
worker.  This  requirement  ensures  that  no  rd  can  be  executed  at  a  replica  before  a  previous 
in2  is  completed  at  that  replica. 

3.4.2  Inter- Worker  Constraints 

The  inter-worker  constraints  are  more  subtle  than  those  on  a  single  worker  because  different 
workers  run  in  parallel. 

For  example,  Figure  3.5  illustrates  the  kind  of  problem  that  can  arise.  It  shows  a  scenario 
where  there  are  two  workers  and  a  replicated  tuple  space.  There  is  at  most  one  tuple  (“x”, 
*),  where  *  is  an  integer,  in  the  tuple  space  at  any  time.  The  tuple  space  contains  tuple 
(“x”,  1)  initially.  Workers  tvi  and  tuj  are  the  only  workers  in  the  system,  and  are  running 
in  parallel.  X,  u,  and  v  are  previously  declared  integer  variables  in  the  workers’  programs. 
In  this  example,  the  integer  value  associated  with  tuples  with  logical  name  “x”  increases 
with  time.  In  the  figure,  w\  modifies  z  in  a  way  that  satisfies  this  constraint;  wj  reads  x 
and  should  not  observe  a  violation  of  the  constraint. 

Forcing  operations  to  be  executed  in  order  at  each  replica  is  not  sufficient  to  enforce 
the  above  constraint  because  rd  can  return  a  value  from  any  replica.  To  illustrate  this, 
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Figure  3.5:  Inter- Worker  Constraints 

we  use  the  same  scenario  above.  Suppose  the  tuple  space  is  replicated  on  rx  and  r2,  and 
both  contain  (“x”,  1)  at  some  point  in  time.  Operations  at  wi  and  w2  occur  as  follows: 
u>i’s  in(“x”,  formal  x)  and  out(“x”,  x  -f  1)  are  executed  at  rj,  w2’s  rd(“x”,  formal  u)  is 
executed  at  rx  and  returns  (“x”,  2),  and  finally,  tuj’s  rd(“x”,  formal  v)  is  executed  at  r2 
and  returns  (“x”,  1),  which  is  incorrect. 

To  remedy  the  problem  above,  we  require  that  requests  for  an  out  operation  not  be  sent 
to  any  replica  until  the  previous  in  operations  issued  by  the  same  worker  are  completed 
at  all  replicas  in  the  current  view.  Thus,  the  tuple  (“x”,  2)  cannot  exist  at  r2  until  (“x”, 
1)  has  been  removed  from  both  rj  and  r2  in  the  above  example.  So  when  rd(“x”,  formal 
u)  returns  with  (“x”,  2)  (from  any  replica),  (“x”,  1)  has  already  been  removed  from  every 
replica. 

3.4.3  Summary 

The  sequential  and  inter-worker  constraints  are  summarized  as  follows: 

1.  The  operations  must  be  executed  at  each  replica  in  the  same  order  as  they  were  issued; 

2.  An  out  operation  must  not  start  until  all  previous  in  operations  issued  on  the  same 
worker  are  completed  at  all  replicas  in  the  worker’s  view. 


The  secon  d  constraint  is  translated  into  “an  out  operation  must  not  start  until  all  previous 
in2’s  issued  on  the  same  worker  are  completed  at  all  replicas  in  the  worker’s  view.”  The 
second  constraint  may  cause  a  delay  in  the  execution  of  the  worker.  The  worker  needs  to 
wait  for  an  out  operation,  but  may  be  delayed  by  a  subsequent  rd  or  in.  We  expect  that 
often  there  will  be  no  delay,  however,  because  previous  ins  will  be  completed  by  the  time 
the  rd  or  in  is  issued. 

3.5  View  Change  Management 

The  failures  mentioned  in  subsection  3.1.2  affect  the  replicas  making  up  the  tuple  space. 
To  mask  these  failures  automatically  and  efficiently,  and  to  preserve  the  single-image  ap¬ 
pearance  of  the  tuple  space,  views  were  introduced. 

Intuitively,  a  view  reflects  the  changing  communication  capability  among  members  of 
a  partition.  When  the  communication  capability  inherent  in  a  view  is  believed  to  have 
changed,  the  replicas  switch  to  a  new  view  by  executing  the  view  change  algorithm;  our 
algorithm  is  a  variation  of  the  original  virtual  partitions  protocol  proposed  by  El  Abbadi, 
Skeen,  and  Cristian  [1],  As  part  of  a  view  change,  the  view  change  algorithm  generates  a 
new  viewid  and  a  new  view.  The  viewid  of  the  new  view  is  guaranteed  to  be  greater  than 
the  viewid  of  any  earlier  view. 

In  Figure  3.6,  we  illustrate  what  the  view  change  algorithm  achieves.  The  original 
configuration  of  the  tuple  space  is  {ri,  r3,  r3,  r4,  rs},  and  the  initial  view  of  these  replicas 
is  {rx,  rj,  r3,  r4,  rs}  with  viewid  <  2,rx  >.  Now  suppose  a  communication  failure  makes 
it  impossible  for  replica  rx  to  talk  to  the  others.  When  this  failure  is  noticed,  the  system 
initiates  a  change  in  view.  As  a  result  of  the  view  change,  a  new  view  {rj,  r3,  r4,  rs},  is 
formed  with  viewid  <  2,rs  >. 

A  new  view  can  be  formed  only  when  it  contains  a  majority  of  the  replicas  in  the 
original  configuration.  If  this  is  impossible,  the  replicas  remain  in  their  old  views.  Thus,  if 
a  modification  operation  (ini,  in2,  or  out)  is  completed  at  all  the  replicas  in  a  view,  this 
implies  that  at  least  a  majority  of  the  replicas  know  the  effect  of  the  operation.  (Recall 
that  a  modification  operation  is  complete  only  when  it  has  occurred  at  all  replicas  in  the 
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worker’s  current  view.) 

As  part  of  a  view  change,  the  algorithm  selects  an  initial  state  for  the  new  view;  all 
replicas  in  the  new  view  will  be  initialized  with  this  state.  The  chosen  state  is  the  state  of 
the  replica  in  the  new  view  whose  previous  viewid  is  greater  than  or  equal  to  the  previous 
viewids  of  all  other  replicas  in  the  new  view.  As  discussed  below,  this  guarantees  that  effects 
of  completed  operations  will  persist  into  all  later  views. 

3.6  Correctness 

The  correctness  of  our  algorithm  depends  on  the  interaction  of  operation  processing  and 
the  view  change  algorithm.  In  this  section,  we  discuss  the  conditions  that  must  be  met  for 
correct  operation. 

1.  The  operations  appear  to  happen  in  the  correct  order. 

This  condition  is  guaranteed  by  the  two  constraints  summarized  in  subsection  3.4.3: 
the  operations  are  executed  at  each  replica  in  the  order  they  are  issued,  and  all  in 
operations  for  a  particular  worker  must  be  completed  at  all  replicas  in  the  current 
view  before  an  out  operation  for  that  worker  starts. 

2.  Completed  modification  operations  occur  at  all  replicas  in  some  view. 

This  is  guaranteed  by  the  operations  protocol.  Both  in  and  out  operations  are  com¬ 
pleted  only  when  their  effects  occur  at  all  the  replicas  in  the  executing  worker’s  view. 

3.  The  effects  of  completed  operations  survive  into  all  subsequent  views. 

This  is  guaranteed  by  the  view  change  algorithm.  If  the  previous  view  contained  a 
majority  of  replicas,  and  the  new  view  also  consists  of  a  majority,  then  both  views 
must  have  at  least  one  replica  in  common  that  was  in  the  previous  view  and  is  now  in 
the  new  view.  The  state  of  the  new  view  is  taken  from  such  a  replica.  Therefore,  the 
new  view  starts  out  knowirg  what  happened  in  the  previous  view.  Since  the  effects  of 
completed  operations  are  known  at  all  replicas  in  the  old  view,  the  effects  of  completed 
operations  survive  into  all  subsequent  views. 


Chapter  4 

Operations  Protocol 


The  execution  of  the  operations  protocol  requires  the  cooperation  of  both  the  workers  and 
the  replicas.  When  a  tuple  space  operation  out,  rd,  or  in  is  encountered  by  a  worker,  a 
request  for  the  operation  is  formed  at  the  worker.  Periodically,  the  requests  are  sent  to 
each  replica  in  the  worker’s  view,  and  are  executed  by  the  replica.  After  the  execution,  the 
replica  sends  back  either  a  result  (if  there  is  one)  or  a  completion  acknowledgment. 

The  messages  that  contain  the  requests  or  answers  can  be  lost,  delayed,  or  duplicated 
by  the  network.  When  a  worker  does  not  receive  all  the  replies  within  an  expected  time 
interval,  it  repeatedly  sends  the  requests  until  it  gets  the  replies  back  from  all  the  replicas 
in  its  view.  This  method  solves  the  problems  of  lost  and  delayed  messages,  but  not  of 
duplicate  messages  (in  fact,  it  generates  duplicate  messages).  A  remedy  to  this  problem  is 
included  in  the  operations  protocol. 

The  next  section  discusses  the  means  of  communication  among  workers  and  replicas. 
Section  4.2  explains  a  worker’s  participation  in  the  operations  protocol.  The  related  activ¬ 
ities  on  a  replica  are  described  in  section  4.3.  The  operations  protocol  is  summarized  in 
section  4.4. 

4.1  Communication  Among  Workers  and  Replicas 

Communication  is  accomplished  by  sending  and  receiving  messages  using  the  send  and 
receive  statements.  This  section  describes  these  statements,  and  the  contents  of  messages 
exchanged  between  workers  and  replicas. 
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4.1.1  Send  and  Receive 
The  form  of  a  send  statement  is 

send(me8sage_type,  parmJist)  to  destination 

where  messagejype  is  a  string  indicating  the  type  of  message  sent,  parmJist  is  a  list  of 
parameters  containing  the  information  to  be  sent,  and  destination  is  the  id  of  the  receiver, 
either  a  replica  or  a  worker,  of  the  message.  As  an  example, 

send(“abc”,  myJd)  to  rJd 

>  will  send  an  message  of  type  “abc”  to  the  replica  rJd.  The  parameter  is  myJd. 

Messages  are  received  using  the  receive  statement.  An  example  is  the  following: 

receive 

foo(x:  int):  Si 

bar(a:  char,  b:  string):  S2 

end. 

If  a  message  with  a  name  matching  one  of  those  listed  in  an  arm  is  waiting  for  the  process 
executing  the  receive,  it  is  selected  and  control  continues  at  the  statement  in  the  matched 
arm.  If  there  are  several  matching  messages,  one  is  selected  nondeterministicaUy.  If  there 
are  no  matching  messages,  the  process  waits  until  one  arrives. 

A  second  form  of  the  receive  statement  allows  the  process  to  wait  until  a  timeout 
expires.  For  example, 

receive  until  t 
foo(x:  int):  Si 
bar(a:  char,  b:  string):  S2 
end  except  when  timeout:  ...  end. 

If  t  =  0,  this  statment  is  identical  to  that  above.  Otherwise,  the  process  waits  for  a  matching 
message  only  so  long  as  the  time  of  the  dock  at  its  node  is  less  than  or  equal  to  t;  when 
its  local  time  is  greater  than  t,  the  statement  terminates  immediately  with  the  timeout 
exception. 


4.1.2  Contents  of  the  Messages 

This  subsection  describes  the  contents  of  the  messages  transmitted  between  workers  and 
replicas. 

A  message  from  a  worker  to  a  replica  is  typically  a  request  to  execute  a  list  of  tuple 
space  operations.  In  addition  to  the  information  needed  to  execute  these  operations,  the 
parameters  in  such  a  message  contain  the  worker’s  current  viewid  and  the  unique  message’s 
unique  id,  the  mid. 

The  viewid  in  the  message  is  compared  at  the  receiving  replica  with  the  replica’s  viewid. 
If  the  worker  and  the  replica  have  the  same  viewid,  the  requests  are  executed  at  the  replica. 
Otherwise,  if  the  replica  has  a  more  recent  view,  the  worker  is  informed  about  the  new 
view,  and  no  operations  are  executed.  If  the  replica  has  an  old  view,  the  worker’s  message 
is  ignored. 

The  mid  is  used  to  weed  out  the  duplicates  and  outdated  replies.  It  is  generated  by 
the  worker  each  time  a  message  is  sent.  When  a  replica  receives  a  message  with  an  mid 
already  seen  before,  the  message  is  a  duplicate,  and  is  ignored.  When  a  worker’s  request 
is  completed,  the  replica  sends  back  the  result  along  with  the  mid  received  in  the  request. 
The  mid  received  at  the  worker’s  side  can  be  used  to  decide  whether  the  reply  is  for  the 
request  just  sent.  Outdated  replies  (the  replies  with  old  mids)  are  weeded  out. 

4.2  Processing  On  a  Worker 

The  last  chapter  explained  that  out  and  in2  (the  second  phase  of  in)  operations  can  be 
non-blocking  —  the  program  process  does  not  have  to  wait  until  the  results  of  the  operations 
come  back.  In  other  words,  the  processing  of  out  and  in2  operations  can  be  done  by  some 
background  process.  This  section  introduces  the  notion  of  the  foreground  and  background 
processes.  Each  worker  contains  a  foreground  process  and  a  background  process.  The  two 
processes  communicate  via  a  shared  data  structure  called  the  operations  log.  The  subsequent 
subsections  explain  the  function  of  these  components. 
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Worker  A 


Request  Queue 


Figure  4.1:  Replicas  and  Internals  of  a  Worker 


4.2.1  The  Components  of  a  Worker 

Figure  4.1  illustrates  the  internals  of  a  worker  A  and  its  relationship  with  the  replicas  (for 
example,  Rit  Rj,  and  fZ3).  There  are  three  major  components  of  a  worker:  a  foreground 
process  (FG),  a  background  process  (BG),  and  an  operations  log  that  includes  a  request 
queue. 

FG  and  BG  communicate  through  the  operations  log.  FG  executes  the  program  in¬ 
cluding  its  accesses  of  the  tuple  space.  It  stores  requests  in  the  operations  log.  BG  retrieves 
requests  from  the  log,  communicates  with  the  replicas  to  carry  them  out,  and  stores  results 
in  the  log.  The  requests  for  the  operations  that  do  not  have  results  (out  and  in2)  are 
removed  from  the  operations  log  by  BG  after  they  are  finished.  When  a  result  is  expected 
(as  in  rd,  ini,  or  unlock),  BG  updates  the  request  entry  on  the  operations  log  with  the 


37 


result  after  the  replies  from  the  replicas  are  received.  The  result  is  picked  up  and  the  entry 
is  removed  by  FG  before  it  continues  its  execution. 

4.2.2  Operations  Log 

The  operations  log  of  each  worker  synchronizes  both  FG  and  BG,  and  records  the  re¬ 
quests  and  answers.  FG  and  BG  can  add,  remove  and  update  the  requests  on  the  operations 
log  by  calling  one  of  the  operations  provided  by  ops  Jog,  the  operations-log  data  type  shown 
in  Figure  4.2.  The  internal  representation  of  an  ops  Jog  is  completely  hidden  from  FG  and 
BG. 

The  log  contains  five  kinds  of  requests:  rd,  out,  ini,  unlock,  and  in2.  The  latter 
three  requests  are  used  to  carry  out  an  in  operation:  ini  does  phase  one,  unlock  releases 
locks  when  this  is  necessary,  and  in2  requests  are  used  to  do  phase  two.  At  any  time,  the 
log  contains  the  most  recent  request,  possibly  preceded  by  some  requests  that  are  executed 
in  the  background  (out  and  in2).  Requests  are  processed  when  they  are  ready.  An  out 
request  is  ready  provided  all  earlier  in2s  are  completed;  other  requests  are  ready  if  all  earlier 
out  requests  are  ready. 

An  operations  log  can  be  created  by  means  of  the  new  operation.  FG  calls  out,  rd,  and 
in  to  add  out,  rd,  or  ini  requests,  respectively.  The  remaining  operations  are  called  by 
BG.  The  result  of  a  rd  request  or  an  ini  request  can  be  delivered  using  rdjins  or  inljans. 
The  out  request  does  not  have  a  result.  The  completed  requests  can  be  removed  from  the 
operations  log  via  the  finished  operation.  A  list  of  outstanding  requests  in  the  operations 
log  can  be  obtained  by  calling  getjops. 

GeLops  returns  a  list  of  ready  operation  requests;  the  list  contains  the  requests  in  order. 
Figure  4.3  shows  the  format  of  these  requests.  RcLop  contains  the  template.  OuLop  contains 
t  (the  tuple  to  be  stored  in  the  tuple  space)  and  tjstamp  (the  timestamp  of  the  operation,  to 
be  explained  later).  Inljop  contains  the  template,  and  in2  contains  the  template  s  (whose 
matching  tuples  in  the  tuple  space  need  be  unlocked),  t  (the  tuple  to  be  deleted  from  the 
tuple  space),  and  tjstamp  (the  timestamp  of  the  operation).  Finally,  unlockjop  contains  the 
template  whose  matching  tuples  are  to  be  unlocked. 
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opsJog  =  abstract  data  type  providing  operations  new,  rd,  rd_ans,  out, 
in,  inl^ans,  finished,  get_ops 

%  OpsJog  is  a  queue  where  requests  for  rd,  out,  and  in  are  added  to  the 
%  top,  and  the  finished  requests  are  removed  from  the  bottom  or  the  top. 

new  =  proc()  returns(opsJog) 

Return  a  new,  empty  operations  log. 

getjops  =  proc(ol:  opsJog)  returns(ops)  %  Ops  is  defined  in  Figure  4-S. 

If  the  operations  log  ol  is  not  empty,  return  the  operations  in  the  log. 
Otherwise,  wait  until  ol  is  not  empty  and  then  return  the  operations. 

out  =  proc(t:  tuple,  ol:  opsJog) 

Form  an  out  request  and  add  it  to  the  operations  log  ol. 

rd  =  proc(s:  tuple,  ol:  opsJog)  returns(tuple) 

Form  a  rd  request  and  add  it  to  ol.  Return  with  the  result  (a  matching 
tuple  to  s)  of  the  rd;  at  this  point  the  rd  request  has  been  removed  from  ol. 

rd_ans  =  proc(t:  tuple,  ol:  opsJog) 

Deliver  a  rd  answer  t  to  the  rd  request  entry  on  ol. 

in  =  proc(s:  tuple,  ol:  opsJog)  returns(tuple) 

Form  an  ini  request  and  add  it  to  ol.  Return  a  copy  of  the  selected  tuple 
matching  a;  at  this  point  all  other  matching  tuples  are  locked.  An  in2 
request  is  formed  and  added  to  ol  before  returning. 

ini  jms  =  proc(lock_set,  cur.view:  replicajset,  tjset:  tuplejset,  ol:  opsJog) 

Deliver  the  ini  answer  to  the  ini  request  entry  on  ol. 

Lockset  is  a  set  of  replicas  having  locks.  Cur.view  is  the  worker’s  current  view. 
Tjset  is  a  set  of  tuples  locked  at  all  the  replicas. 

unlock-ans  =  proc(ol:  opsJog) 

Inform  the  unlock  entry  on  ol  about  its  completion. 

finished  =  proc(k:  int,  ol:  opsJog) 

Remove  the  first  k  requests  from  ol,  and  k  >  0. 


Figure  4.2:  Specification  for  Operations  Log 
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ops  =  array[op] 

op  =  oneof[rd:  tuple,  out:  outjop,  ini:  tuple,  in2:  in2jop,  unlock  :  tuple] 

outjop  =  record [t:  tuple,  t.stamp:  int] 

in2_op  =  record[s:  tuple,  t:  tuple,  tjstamp:  int] 


Figure  4.3:  Ops  Type 


cur.view  :  view 

%  Initial  value  =  set  of  all  replicas 

cur.viewid:  viewid 

%  Initial  value  =  <  0,  myJd  > 

mid:  int 

%  Message  id,  initial  value  =  0 

myJd:  worker  id 

%  Worker’s  id 

ol:  opsjog 

%  Initial  value  =  opsJog$new() 

where 

view  =  replica_set 

Figure  4.4:  State  of  a  Worker 

There  is  at  most  one  rd,  ini  or  unlock  request  in  the  operations  log  at  any  given 
moment.  This  is  because  these  operations  block  FG  from  further  processing  until  the 
results  or  completion  acknowledgments  are  received.  The  completed  rd,  ini,  or  unlock 
request  is  deleted  from  the  operations  log  before  FG  continues  its  execution. 

4.2.3  Worker  State 

Both  FG  and  BG  of  a  worker  can  change  the  worker’s  state.  The  state  of  a  worker  is 
summarized  in  Figure  4.4.  Cur.view  contains  the  set  of  replicas  in  the  worker’s  current  view. 
It  always  contains  a  majority  of  replicas  in  the  system.  No  attempt  is  made  to  communicate 
with  the  replicas  outside  of  cur-view.  The  variables  in  Figure  4.4  are  initialized  to  their 
initial  values  before  a  program  starts. 


4.2.4  FG  Processing 

FG  of  a  worker  carries  out  the  program  processing.  Wherever  FG  encounters  an  out, 
rd,  or  in,  it  invokes  the  corresponding  procedure  shown  in  Figure  4.5.  These  procedures 
interact  with  the  operations  log  by  adding  the  requests  and  picking  up  the  results. 
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out  =  proc(t:  tuple) 
ops_log$out(t,  ol) 
end  out 

rd  =  proc(s:  tuple) 

%  8  is  mutated  so  that  its  formats  are  assigned  the  actuals  of  a  matching  tuple. 
t:  tuple  :=  ops_log$rd(s,  ol) 

tuple$assign(s,  t)  %  Assign  the  actuate  oft  to  the  formats  of  s. 
end  rd 

in  =  proc(s:  tuple) 

%  s  is  mutated  so  that  its  formats  are  assigned  some  values. 
t:  tuple  :=  opsJog$in(s,  ol) 

tuple$assign(s,  t)  %  Assign  the  actuals  of  t  to  the  formats  of  s. 
end  in 


Figure  4.5:  Out,  Rd,  and  In  Procedures 


4.2.5  BG  Processing 

BG  actively  checks  if  there  are  outstanding  operation  requests  on  the  operations  log.  If  so, 
it  sends  a  copy  of  the  operations  to  all  the  replicas  in  the  worker’s  view,  and  waits  until 
it  is  informed  that  the  operations  have  been  executed  at  all  the  replicas.  When  a  list  of 
operations  is  sent  to  a  replica,  it  is  guaranteed  that  the  order  of  the  operations  remains 
the  same  during  the  transmission.  At  the  replica,  the  operations  are  executed  in  the  same 
order. 

The  worker’s  curjoiewid  is  piggybacked  on  the  operations  list.  If  the  worker’s  view  is 
the  same  as  the  replica’s,  the  operations  are  executed,  and  their  completion  and  results  (if 
any)  are  acknowledged  by  the  replica.  If  the  worker’s  view  is  more  recent  than  that  of  the 
replica’s,  the  operations  are  ignored.  If  the  worker’s  view  is  old,  the  operations  are  ignored, 
and  BG  is  informed  about  the  new  view.  Whenever  BG  receives  a  new  view,  it  updates 
cur.view  and  cur.viewid  of  its  worker. 

If,  within  a  reasonable  delay,  BG  does  not  receive  acknowledgments  from  all  the  replicas 
in  its  cur.view  for  the  operations  sent,  the  same  message  is  repeated  (with  a  new  mid)  until 
all  the  replies  are  received. 
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1  while  true  do 

l 


receive  until  ti 

tag  rd_ans(m:  int,  rr:  replica,  found?:  bool,  t:  tuple): 
if  m  ^  mid  then 

continue  %  continue  to  the  next  iteration  of  inner  while  loop. 
end  %  if 

if  found?  &  ^returned?  then 
op8Jog$rd-ans(t,  ol) 

if  k  =  1  then  break  %  exit  inner  while  loop 
else  returned?  :=  true 
end  %  if 
end  %  if 

rjset  =  r_set  U  {rr} 
if  (|rjjet|  =  |cur_viewj)  then 
ops_log$fini8hed(k  —  1,  ol) 
break 
end  %  if 

tag  inljan8(m:  int,  rr:  replica,  locked?:  bool,  t_set:  tuplejset): 
if  m  /  mid  then  continue  end  %  if 
if  locked?  then 

lock-set  :=  lock-set  U  {rr} 
inlans  :=  in  Ians  n  t  jet 
end  %  if 

r jet  :=  rjset  U  {rr} 
if  |r_set|  =  |cur_view|  then 
opsJog$finished(k  -  1,  ol) 
opsJogiin  1  -ans(lock  jset ,  cur  .view,  inlans,  d) 
break 
end  %  if 


mid  :=  mid  +  1 

opsJist:  ops  :=  ops Jog$getjops(ol) 

k:  int  :=  ops$size(op8)  %  The  number  of  operations  in  opsJist. 

%  The  following  four  variables  keep  track  of  reply  information  to  various  requests. 
rjset:  replica-set  :=  {}  %  The  set  of  replicas  that  have  replied. 
inlans:  tuple-set  :=  tuplejet$all()  %  Set  containing  all  tuples  in  the  tuple  space. 
lock-set:  replica-set  :=  {}  %  The  set  of  replicas  that  have  locks  for  ini. 
returned?:  bod  :=  false  %  Indicating  if  a  result  has  been  delivered  to  a  rd  request. 

for  r:  replica  in  cur.view  do 

send  (“ops”,  opsJist,  mid,  myjd,  cur.viewid)  to  r 
end  %  for 

ti:  int  :=  current_time()  +  Si 

while  true  do 


Figure  4.6:  BG  Routine  Part  I 
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tag  unlock_aas(m:  int,  rr:  replica): 

if  m  ^  mid  then  continue  end  %  if 
r_set  :=  rjset  U  {rr} 
if  |r_set|  =  |cur_view|  then 
opsJog$unlockjui8(ol) 
break 


end  %  if 

tag  in2(m:  int,  rr:  replica): 

if  m  jt  mid  then  continue  end  %  if 
r_set  :=  r_set  U  {rr} 
if  |r_set|  =  |cnr_view|  then 
op8_log$finished(k,  ol) 
break 


end  %  if 

tag  out(m:  int,  rr:  replica): 

if  m  /  mid  then  continue  end  %  if 
rjset  :=  r jet  U  {rr} 
if  | r  jet  |  =  |cur.view|  then 
opsJog$finiBhed(k,  ol) 


break 
end  %  if 

tag  newview(#:  viewid,  t.view:  view): 
if  #  >  cnr.viewid  then 
cur  .view  :=  t.view 
cur.viewid  :=  # 

break  %  continue  to  the  outer  loop 
end  %  if 
end  %  receive 

except  when  timeout:  break  end  %  except 
end  %  while 
end  %  while 


Figure  4.7:  BG  Routine  Part  II 
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The  BG  routine  is  shown  in  Figures  4.6  and  4.7.  (In  the  code,  a  break  statement  causes 
an  exit  from  the  smallest  containing  loop;  a  continue  statement  causes  control  to  continue 
with  the  next  iteration  of  the  smallest  containing  loop.) 

A  completion  acknowledgment  or  a  result  received  by  a  worker  from  a  replica  corresponds 
to  the  last  operation  in  the  operations  list  sent.  It  is  also  an  indication  that  all  previous 
operations  have  been  completed  at  that  replica.  Recall  that  if  a  rd,  an  ini,  or  an  unlock 
is  present  in  the  operations  log,  it  must  be  the  last  entry  in  the  list.  There  might  be  any 
number  of  out  operations  in  the  list.  Only  one  in2  entry  is  possible  at  any  given  time  since 
the  completion  of  an  ini  operation  implies  that  all  previous  operations  (including  in2’s) 
are  completed. 

For  a  rd  answer,  the  first  matching  tuple  returned  (from  any  replica)  is  used  to  update 
the  rd  request  entry  in  the  operations  log.  If  the  rd  is  the  only  request  on  the  operations  log, 
the  replies  from  all  other  replicas  are  ignored.  Otherwise,  BG  has  to  wait  until  the  replies 
from  all  the  replicas  in  its  cur.view  are  received,  though  only  the  first  matching  tuple  is 
used  in  the  result  of  the  rd  operation.  This  is  because  all  previous  operations  (out’s  or  an 
in2  or  both)  must  be  completed  on  all  replicas  in  cur.view  before  the  requests  are  removed 
from  the  operations  log. 

For  an  ini  answer,  BG  must  receive  replies  from  all  the  replicas  in  the  view  in  order  to 
make  the  decision  about  which  tuple  to  remove  from  the  tuple  space.  Once  all  the  replies 
are  received,  the  previous  requests  are  removed  from  the  operations  log. 

When  an  ini  cannot  get  the  locks  on  a  majority  of  the  replicas  in  the  view,  the  worker 
tries  to  release  the  locks  by  replacing  the  ini  entry  on  the  operations  log  by  an  unlock 
entry.  Unlock  must  be  the  only  entry  on  the  operations  log  since  all  the  previous  requests 
are  removed  by  ini.  Therefore,  when  the  replies  from  all  the  replicas  in  the  view  are  received 
for  an  unlock  entry,  there  is  no  need  to  remove  any  more  requests  from  the  operations  log 
other  than  the  unlock  request  itself. 

For  an  out  or  in2  request,  when  BG  receives  replies  from  all  the  replicas  in  cur.view, 
the  completed  requests  can  be  removed  from  the  operations  log. 

When  the  replica  receives  a  new  view  message,  it  updates  the  local  view  and  viewid  if 


.-  4T-t  ■ 
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reqs  =  array  [req] 

req  =  oneof[rd:  rd_req,  oat:  outjeq,  ini:  inljreq,  in2:  in2_req,  unlock  :  unlock_req] 
rd_req  =  record[s:  tuple,  t:  tuple] 
outjreq  =  record  [t:  tuple,  t .stamp:  int] 

inljreq  =  record[s:  tuple,  tjset:  tuple  jet ,  all?:  bool,  maj?:  bool] 
in2_req  =  record[s:  tuple,  t:  tuple,  t  .stamp:  int] 
unlockjreq  =  record[s:  tuple] 

Figure  4.8:  Request  Queue  Type 

the  viewid  in  the  message  is  more  recent.  Otherwise,  the  new  view  message  is  ignored. 

If  not  all  the  replicas  have  responded  to  the  requests  within  a  reasonable  time,  the 
requests  in  the  operations  log  are  sent  to  the  replicas  again,  and  the  whole  process  is 
repeated. 

Note  that  the  log  can  contain  the  following  requests:  An  unlock  is  always  the  only 
request  in  the  log.  Otherwise,  there  can  be  zero  or  one  in2  requests,  followed  by  zero  or 
more  out  requests,  followed  by  a  single  rd  or  ini.  If  the  log  contains  an  in2  followed  by  an 
out,  the  out  and  all  requests  that  follow  it  are  not  ready;  otherwise,  all  requests  are  ready. 

4.2.6  Implementing  the  Operations  Log 

This  section  describes  the  implementation  of  the  operations  log  specified  in  Figure  4.2.  In 
addition  to  synchronizing  FG  and  BG  and  recording  requests  and  answers,  the  operations 
log  also  assigns  timestamps  to  requests  that  need  them.  The  importance  of  the  timestamps 
will  be  discussed  in  the  next  section. 

An  operations  log  consists  of  a  request  queue,  a  timestamp  generator,  two  boolean  flags, 
and  the  tickets.  The  request  queue,  reqs,  is  an  array  of  requests.  The  format  of  the  requests 
is  shown  in  Figure  4.8.  A  request  is  enqueued  by  calling  addh,  which  appends  the  request 
at  the  back  of  the  array;  a  request  is  dequeued  by  calling  reml  or  remh;  these  operations 
remove  an  entry  from  the  front  or  the  end  of  the  array,  respectively.  The  array  operations 
addh  and  reml  are  indivisible,  that  is,  no  other  operations  can  be  executed  on  the  array 
when  addh  and  reml  are  in  progress.  This  keeps  the  queue  from  being  updated  by  both 


ticket  =  abstract  data  type  providing  operations  init,  await_ge,  await,  dec,  inc 

%  Ticket  is  a  mutable  container  of  an  integer. 

init  =  proc()  ret  urns  (ticket) 

Return  a  new  ticket  containing  zero. 

await  =  proc(t:  ticket,  n:  int) 

Return  when  the  ticket  t  contains  n. 

await_ge  =  proc(t:  ticket,  n:  int) 

Return  when  the  ticket  t  contains  a  value  greater  than  or  equal  to  n. 

dec  =  proc(t:  ticket,  n:  int) 

Reduce  t  by  n.  Dec  is  indivisible. 

inc  =  proc(t:  ticket,  n:  int) 

Increase  t  by  n.  Inc  is  indivisible. 

end  ticket 

Figure  4.9:  Specification  for  Ticket 

FG  and  BG  simultaneously. 

The  timestamp  generator  timestamp  is  implemented  as  an  integer  counter  that  assigns 
a  new  timestamp  to  a  request  to  be  enqueued  when  needed.  A  new  timestamp  is  generated 
by  incrementing  the  integer. 

The  flags  are  used  to  determine  when  requests  are  ready.  Flag  inS?  is  true  whenever 
there  is  an  in2  request  in  the  log;  inout?  is  true  if  an  out  request  follows  this  in2  request. 

The  tickets  #reqs  and  #ans  are  used  to  keep  track  of  the  number  of  outstanding 
requests  in  the  queue  and  the  number  of  outstanding  answers. 

Tickets  are  specified  in  Figure  4.9.  They  provide  operations  to  increment  and  decrement 
their  values,  and  also  to  allow  a  process  to  wait  for  a  ticket  to  have  a  specified  value.  Tickets 
allow  FG  and  BG  to  synchronize  with  one  another,  for  example,  BG  can  wait  until  #reqs 
contains  a  value  greater  than  or  equal  to  0. 

} 


ops  Jog  =  cluster  is  new,  rd,  rd_ans,  out,  in,  ini  juts,  unlock  atir,  finished 
getjops 

rep  =  record  [request.queue:  reqs,  timestamp:  int,  in2?,  inout?:  bool, 

#ans,  #reqs, :  ticket] 

new  =  proc()  returns(cvt) 

return(rep${request_queue:  reqs$new(),  timestamp:  0,  in2?:  false, 
inoat?:  false,  #ans,  #reqs:  ticket$init()} 

end  new 

get_ops  =  proc(ol:  cvt)  returns(op8) 

%  If  olrequesLqueue  is  not  empty,  return  all  ready  request  entries.  Otherwise, 
%  wait  until  ol.  request-queue  is  not  empty. 
ticket$await_ge(ol.#reqs,  1) 
tempjops:  ops  :=  ops$new() 

for  request:  req  in  reqs$elements(ol.reque8tjqueue)  do 
%  reqtop  returns  the  corresponding  op  of  req 
opsSaddh( tempjops,  req2op(reqnest)) 

if  inoat?  then  return  end  %  just  return  first  element  in  this  case 
end  %  for 
ret  urn(  tempjops) 
end  get  jops 

out  =  proc(t:  tuple,  d:  cvt) 

%  Log  the  out  reguest  on  ol. 

ol. timestamp  :=  ol. timestamp  +  1 

oe:  out_req  :=  out _req${t:  t,  tjstamp:  ol.timestamp} 

reqs$addh(ol.request_queue,  oe) 

if  in2?  then  inout?  :=  true  end 

ticket$inc(ol.#reqs,  1) 

end  out 

rd  =  proc(s:  tuple,  ol:  cvt)  retunu(tuple) 

%  Return  a  copy  of  a  tuple  matching  s. 
re:  rdjeq  :=  rdjreq${s:  s,  t:  tuple$nil()} 
reqs$addh(ol.reque8tjqueue,  re) 
ticketSinc(ol.#reqs,  1) 
ticket$await_ge(ol.#ana,  1) 
ticket$dec(ol.#ans,  1) 
reqsSremh(ol.requestjqueue) 
ret  urn(  re.  tuple) 
end  rd 


Figure  4.10:  Operations  Log  Cluster  Part  I 
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rd_ans  =  proc(t:  tuple,  ol:  cvt) 

%  Deliver  a  rd  arts  t  to  ol;  rd  must  be  the  top  entry  in  ol. 
tagcase  reqs$top(ol.request  .queue) 
tag  rd(re:  rd_req): 
re.t  :=  t 

ticket$dec(ol.#reqs,  1) 
ticket$inc(ol.#ans,  1) 
others:  %  Not  possible. 
end  tagcase 
end  rd^ans 

in  =  proc(s:  tuple,  ol:  cvt)  returns(tuple) 

%  Return  a  copy  of  a  selected  tuple  matching  s  while  all  matching 
%  tuples  are  locked,  and  in2  request  is  logged  on  ol. 

ie:  inl-req  :=  inl_req${s:  s,  tjset:  tuplejset$nil(),  all?:  false,  maj?:  false) 
req8$addh(ol.request_queue,  ie) 
ticket$inc(ol.#reqs,  1) 
while  true  do 

ticket$await.ge(ol.#ans,  1) 
ticket$dec(ol.#ans,  1) 
if  ie.all?  then  %  All  replicas  have  locks. 
if  <■-  tuple_set$empty?(ie.t-set)  then 

res:  tuple  :=  tuple_set$select(ie.t_set)  %  Any  one  will  do. 
ol.time8tamp  :=  ol. timestamp  +  1 

i2e:  in2_req  :=  m2_req${s:  s,  t:  res,  tjstamp:  ol. timestamp) 

req8$remh(ol.requestjqueue) 

reqs$addh(ol.request_queue,  i2e) 

ol.in2?  :=  true 

ticket$inc(ol.#req8,  1) 

return(res) 

else  %  i.e.,  if  ie.all?=true  &  ie.t _set={),  repeat  ini. 
ie.all?  :=  false 
ticketlinc(ol.#req8,  1) 
end  %  if 


Figure  4.11:  Operations  Log  Cluster  Part  II 
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elseif  ~  ie.majT  then  %  No  majority  locks  —  Unlock. 
ue:  unlock_req  :=  unlock_req${s:  s} 
reqs$remh(ol.requestjqueue) 
reqs$addh(ol.request .queue,  ue) 
ticket$inc(ol.#req8,  1) 
ticket$await_ge{ol.#ans,  1) 
ticket$dec(ol.#ans,  1) 
reqs$remh(ol.requestjqueue) 
reqsladdh(ol.requestjqueue,  ie) 
ticket$inc(ol.#reqs,  1) 

else  %  i.e.,  ie.ail?  =  false,  ie.maj  =  true,  repeat  ini. 
ie.maj?  :=  false 
ticket$inc(ol.#reqs,  1) 
end  %  if 
end  %  while 
end  in 

inl_ans  =  proc(lock_set:  replica_set,  cur.view:  view,  t_set:  tuplejset,  d:  cvt) 
%  Inform  ol  that  all  the  replies  to  the  top  entry  (Ini )  are  received. 

%  lockjset  is  the  set  of  replicas  having  locks. 

%  Lsets  is  a  set  of  common  tuples  locked  by  all  replicas. 
tagcase  reqs$remh(ol.request_queue) 
tag  inl(ie:  ini  jeq): 

if  |lock_set|  =  |cur_viewj  then 
ie.all?  :=  true 
ie.tjjet  :=  tjset 

else  ie.maj?  :=  i8maj?(lock_set,  cur.view) 

%  ismajffsl,  st)  returns  true  if  si  is  a  majority  of  sS,  and 
%  returns  false  otherwise. 
end  %  if 

ticket$dec(oL#reqs, 1) 
ticket$inc(ol.#ans,  1) 
others:  %  Not  possible. 
end  %  tagcase 
end  inl-ans 

unlock .ans  =  proc(ol:  cvt) 

%  The  unlock  entry  is  done. 
ticket$dec(ol.#reqs,  1) 
ticket$inc(ol.#ans,  1) 
end  unlock  ^ans 


Figure  4.12:  Operations  Log  Cluster  Part  III 
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finished  =  proc(k:  int,  ol:  cvt) 

%  Notify  ol  that  the  first  k  entires  have  been  processed.  Purge  all 
%  these  entries  from  ol.  request-queue,  and  decrement  #reqs  and  #in2s. 
for  i:  int  in  mt$from-to(reqs$low(ol.request_queue), 
reqs$low(ol.request_queue)  +  k  -  1)  do 
ticket$dec(oi.#reqs,  1) 
end  %  for 

in2?  :=  false  %  reset  flags  since  any  in2  requests  have  now  been  removed 
inout?  :=  false 
end  finished 
end  reqs_log 


Figure  4.13:  Operations  Log  Cluster  Part  IV 

The  implementation  of  the  operations  logs  is  shown  in  Figures  4.10-4.13-  The  basic 
strategy  is  the  following: 

1.  Requests  are  added  by  enqueuing  them  on  the  request  queue,  incrementing  flreqs, 
and  setting  ini?  and  inout?  accordingly. 

2.  If  an  answer  to  a  rd,  an  ini,  or  an  unlock  request  is  ready,  the  request  on  the 
request  queue  is  updated,  the  flreqs  ticket  is  decremented,  and  the  flans  ticket  is 
incremented.  When  the  answer  is  picked  up,  the  flans  ticket  is  decremented,  and  the 
entry  on  the  request  queue  is  deleted. 

3.  F inished  removes  the  specified  number  of  (out  and  in2)  entries  from  the  bottom  of 
the  request  queue,  decrements  flreqs  accordingly,  and  resets  the  in2?  and  inout? 
flags.  Resetting  the  flags  is  appropriate  since  if  there  was  an  in2  entry  in  the  log,  it 
has  now  been  removed. 

4.  Getjops  blocks  the  calling  process  until  the  flreqs  is  greater  than  zero  and  then  returns 
a  list  of  operations  corresponding  to  the  ready  requests  in  ‘he  request  queue. 

The  tuple  space  operations  out,  rd,  and  in  are  processed  as  follows  (refer  to  Figures  4.5, 
4.6,  4.7,  and  4.10  -  4.13): 
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•  Out(t)  —  FG  forms  an  out  request  and  enqueues  the  request  on  the  request  queue, 
inout?  is  set  to  true  if  there  is  an  in2  entry  in  the  log  (m2?  =  true),  and  #reqs  is 
incremented  (the  out  operation  can  return  at  this  point).  This  enables  BG  to  receive 
the  operation  request  using  getjops.  When  the  out  is  finished,  BG  calls  finished  to 
remove  the  request  from  the  request  queue  and  to  decrement  #reqs. 

•  Rd(s)  —  The  rd  operation  places  the  request  on  the  request  queue,  increments  the 
#reqs  ticket,  and  waits  until  #ans  becomes  nonzero.  When  that  happens,  rd  resets 
#ans,  picks  up  the  result  in  the  rd  entry  on  the  queue,  deletes  the  entry,  and  assigns 
the  actuals  in  the  result  to  the  formals  in  s. 

The  answer  to  the  rd  entry  is  delivered  by  BG  by  calling  rdjins  when  one  of  the 
replicas  responds  with  a  matching  tuple.  Rdjans  decrements  #reqs,  updates  the  rd 
request  with  the  matching  tuple,  and  increments  #ans. 

•  In(s)  —  The  in  operation  places  an  ini  request  on  the  request  queue  and  increments 
#reqs.  This  causes  BG  to  do  the  request  and  to  return  the  answer  by  calling  tnl_ans, 
which  stores  the  information  obtained  by  BG  in  the  entry,  decrements  #reqs,  and 
increments  #ans.  Meanwhile  in  waits  until  #ans  is  nonzero.  Then  it  resets  #ans 
and  checks  the  information  in  the  updated  ini  entry.  If  all  the  replicas  in  the  view 
have  set  the  locks  and  the  intersection  of  the  returned  tuple  sets  is  not  empty,  a 
random  tuple  is  selected  from  the  intersection,  the  ini  entry  is  replaced  by  an  in2 
on  the  request  queue,  m2?  is  set,  #reqs  is  incremented,  the  actuals  of  the  selected 
tuple  are  assigned  to  the  formals  of  s,  and  in  returns.  If  a  majority,  but  not  all,  of  the 
replicas  in  the  view  have  set  locks,  or  if  all  have  locks  but  the  intersection  is  empty, 
the  ini  entry  is  left  on  the  queue  and  #reqs  is  incremented  to  cause  the  request  to 
be  repeated  by  BG.  Otherwise,  the  ini  entry  on  the  request  queue  is  replaced  by  an 
unlock  entry,  and  #reqs  is  incremented;  this  causes  BG  to  release  the  locks.  After 
the  locks  are  released,  the  unlock  entry  is  replaced  by  the  ini  request  so  that  the 
ini  can  be  tried  again. 
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4.3  Processing  On  a  Replica 

The  processing  of  a  worker  described  above  is  coupled  with  the  processing  of  a  replica. 
Replicas  are  not  only  responsible  for  executing  the  operations  on  tuple  space  copies,  but 
also  for  discarding  out  the  duplicate  messages.  This  section  describes  these  activities. 

4.3.1  Timestamp-Mid  Table 

Replicas  may  receive  more  than  one  message  for  the  same  operation,  either  because  BG 
sends  a  request  more  than  once  or  because  of  duplication  in  the  network.  The  unequal 
mids  are  used  to  recognize  and  discard  duplicates  generated  by  the  network,  but  are  not 
sufficient  to  discard  operation  requests  sent  multiple  times  by  a  worker  because  a  new  mid 
is  used  every  time  a  message  is  sent.  Some  operations  can  be  repeated  without  causing  any 
inconsistencies;  others  cannot.  For  instance,  repeated  out(<)’s  will  store  multiple  copies  of 
t  when  only  one  is  appropriate;  repeated  in2’s  may  cause  too  many  tuples  to  be  deleted. 
On  the  other  hand,  rd  and  unlock  can  be  repeated  without  creating  inconsistencies.  We 
call  out  and  in2  unrepeatable  operations.  To  avoid  unrepeatable  operations  being  executed 
more  than  once  at  a  replica,  a  timestamp  is  associated  with  each  unrepeatable  operation. 
Each  replica  keeps  a  table  of  the  last  timestamp  seen  for  each  worker.  These  timestamps 
indicate  the  workers’  high  water  marks  —  all  the  unrepeatable  operations  issued  by  a  worker 
with  timestamps  at  or  below  the  worker’s  high  water  mark  have  already  been  executed,  and 
should  not  be  executed  again. 

Information  about  mids  is  also  stored  in  the  table.  If  a  replica  has  seen  the  n-th  message 
from  a  worker,  then  any  message  before  the  n-th  is  obsolete  and  can  be  ignored.  Storing 
mids  is  not  necessary  for  the  correctness  of  the  protocol.  It  is  merely  an  optimization. 

The  timestamp  and  mid  information  about  all  the  workers  is  kept  by  a  replica  using  a 
table  called  the  timestamp-mid  table.  Figure  4.14  gives  the  specification  of  the  table.  A 
table  resides  at  each  replica.  It  records  the  timestamp  of  the  last  unrepeatable  operation 
the  replica  has  executed,  and  the  latest  mid  the  replica  has  seen  for  each  worker.  There  is 
at  most  one  entry  for  each  worker. 


table  =  abstract  data  type  providing  operations  new,  get.ts,  get_mid, 
update.ts,  update_mid 

%  A  table  contains  the  last  timestamp  seen  and  last  mid  received 

%  by  a  replica  for  each  worker.  There  is  at  most  one  entry  for  each  worker. 

%  Tables  are  mutable. 

new  =  procQ  ret  urns(  table) 

Return  a  new  table  containing  no  entries. 

get.ts  =  proc(tb:  table,  w;  workerJd)  returns(int) 

Return  the  timestamp  of  the  last  unrepeatable  operation  issued  by  w. 

If  w  is  not  already  in  tb,  add  an  entry  for  w  in  tb  with  the 
initial  timestamp  and  mid,  and  return  the  initial  timestamp. 

get_mid  =  proc(tb:  table,  w:  workerJd)  returns(int) 

Return  the  most  recent  mid  of  w.  If  there  is  no  entry  for  w  in  tb,  create 
one  with  the  initial  timestamp  and  mid,  and  return  the  initial  mid. 

update_ts  =  proc(tb:  table,  w:  workerJd,  ts:  int) 

Update  w's  timestamp  field  with  ts. 

update_mid  =  proc(tb:  table,  w:  workerJd,  mid:  int) 

Update  the  mid  field  of  w  with  mid. 

end  table 


Figure  4.14:  Specification  for  Timestamp-Mid  Table 
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4.3.2  Tuple  Space 

Each  replica  keeps  a  copy  of  the  tuple  space.  In  our  protocol,  we  have  assumed  that  a 
tuple  space  is  implemented  as  a  set  of  tuple  sets.  The  tuples  with  the  same  logical  name 
are  grouped  into  the  same  set.  Each  set  has  a  lock.  When  a  set  is  locked  by  one  worker, 
no  other  worker  can  place  a  lock  or  delete  any  of  the  tuples  from  the  set  until  the  lock  is 
released.  Reading  of  a  locked  tuple  is  allowed,  however. 

The  specification  of  the  tuple  space  and  its  operations  is  given  in  Figure  4.15. 

Notice  that  there  can  be  only  one  lock  on  a  tuple  set  at  any  given  moment.  When  a 
tuple  set  is  locked,  the  tuples  in  the  set  can  be  deleted  only  by  the  worker  that  set  the  lock. 
A  locked  tuple  set  can  still  accept  tuples  stored  by  other  workers.  The  new  incoming  tuples 
are  automatically  locked  once  they  enter  a  locked  set. 

4.3.3  Replica  State 

The  part  of  a  replica’s  state  that  affects  the  operations  protocol  is  summarized  in  Fig¬ 
ure  4.16.  Initially,  the  local  view  and  its  id  are  undefined  on  each  replica.  An  execution  of 
the  view  change  protocol  is  necessary  to  form  a  meaningful  view.  This  will  become  clear  in 
the  next  chapter.  The  local  tuple  space  copy  and  the  timestamp-mid  table  are  initialized 
to  using  tupleinewQ  and  table$new(),  respectively. 

4.3.4  Executing  Operations 

When  a  replica  is  “active”,  it  calls  the  procedure  executejops,  shown  in  Figures  4.17 
and  4.18,  whenever  it  receives  an  operations  list  from  a  worker.  The  arguments  needed  are 
the  following:  opsJist  (the  operations  list),  mid  (the  mid  corresponding  to  the  message 
sent  by  the  worker),  w  (worker’s  id),  and  #  (worker’s  view  id). 

4.4  Summary 

This  chapter  has  described  the  operations  protocol  in  detail.  Its  execution  requires  the 
coupling  of  both  the  worker’s  processing  and  part  of  the  replica’s  processing. 
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tuple_space  =  abstract  data  type  providing,  operations  new,  delete_unlock,  lock, 
unlock,  search,  store 

new  =  proc()  returns(tuple_space) 

Return  a  new  empty  tuple  space. 

store  =  proc(tspace:  tuplejspace,  t:  tuple) 

Store  t  in  tspace. 

lock  =  proc(tspace:  tuplejspace,  s:  tuple,  w:  worker)  ret urns(tuple-set)  signals(refused) 
If  the  set  containing  tuples  with  s’ s  logical  name  is  not  yet 
locked  by  a  worker  other  than  w,  lock  the  set  and  return  the  set  of 
matching  tuple(s).  (If  there  are  no  matching  tuples,  return  an  empty  set.) 

If  the  tuple  set  has  already  been  locked  by  a  worker  other  than  w,  signal 
refused. 

unlock  =  proc(tspace:  tuplejspace,  s:  tuple,  w:  worker) 

Unlock  the  set  that  has  the  same  logical  name  as  s  and  is  locked  by  w, 
if  such  a  set  exits.  Otherwise,  do  nothing. 

delete_unlock  =  proc(  tspace:  tuple_space,  t:  tuple,  s:  tuple) 

Delete  t  from  tspace  and  unlock  the  tuple  set  that  matches  s. 

search  =  proc(tspace:  tuplejspace,  s:  tuple)  returns(tuple)  signals(notJfound) 

Search  tspace  for  a  tuple  that  matches  s.  If  one  is  found,  return  it. 

Otherwise,  signal  not. found. 

end  tuple-space 


Figure  4.15:  Specification  for  Tuple  Space 
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%  replica  is  active  or  doing  a  view  change 
%  Initial  value  —  undefined. 

%  Initially  =  undefined. 

%  Replica's  id. 

%  Initial  value  —  tuplespace$new( ) 

%  Initial  value  =  table$new(). 


status  =  oneoff  active,  view  .manager,  underling:  null] 
viewid  =  <n:  int,  r:  replicaJd> 
view  =  replica^set 


Figure  4.16:  Replica  State  (Partial) 

In  addition  to  ensuring  that  the  tuple  space  operations  are  executed  on  all  the  replicas  in 
cur-view  eventually,  the  protocol  guarantees  that  no  undesirable  effects,  such  as  storing  or 
deleting  too  many  tuples,  result.  To  achieve  this,  the  workers  send  their  requests  repeatedly 
until  they  are  satisfied  with  the  returned  results,  and  the  replicas  discard  the  operations 
they  have  already  executed.  Timestamps  and  raids  are  used  to  detect  duplicate  operations 
and  messages. 

Unnecessary  delay  of  program  processing  is  avoided  by  the  introduction  of  background 
process  (at  each  worker),  which  continuously  processes  the  requests  generated  by  the  pro¬ 
gram  process.  The  program  process  is  blocked  only  when  it  needs  to  know  the  result  or  to 
ensure  the  constraint  that  in2’s  must  be  finished  before  out’s  begin  is  obeyed. 

The  program  (foreground)  process  and  the  background  process  at  each  worker  commu¬ 
nicate  with  each  other  via  a  data  structure  called  the  operations  log.  The  log  synchronizes 
the  processes,  logs  the  outstanding  requests  and  results,  and  generates  timestamps  to  pre¬ 
vent  duplicate  processing  or  unrepeatable  operations.  Another  attractive  feature  of  the 
operations  log  and  the  background  process  is  that  they  provide  a  level  of  abstraction  that 
hides  the  tuple  space  replication  from  the  program  process. 

The  correctness  and  efficiency  of  the  operations  protocol  depend  largely  on  the  assump¬ 
tion  that  view  changes  are  correctly  taken  care  of  by  the  view  change  algorithm.  The  next 


status:  status 
cur_view:  view 
cur.viewid:  viewid 
myJd:  replica  id 
tjspace:  tuplejspace 
tbl:  table 


where 


I 
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executejops  =  proc(opsjist:  ops,  mid:  int,  w:  worker Jd,  #:  viewid) 

if  mid  <=  table$get_mid(tbl,  w)  then  return 
else  table$update_mid(tbl,  w,  mid) 
end  %  if 

if##  cur.viewid  then 

send(“newview”,  cur.viewid,  cur.view)  to  w 

return 

end  %  if 

for  operation:  op  in  ops$elements(opsJist)  do 
tagcase  operation 

tag  rd(e:  tuple): 

found?:  bool  :=  true 

t  :  tuple  :=  tupleSnilQ 

t  :=  tuple_space$search(t  jpace,  e) 

except  when  not  .found:  found?  :=  false  end  %  except 
send(“rdjns”,  mid,  myJd,  found?,  t)  to  w 
return 

tag  out(e:  outjop): 

if  e.tjstamp  >  table$get.ts(tbl,  w)  then 
table$update_ts(tbl,  w,  e.t  .stamp) 
tuple_space$store(t  .space,  e.t) 
end  %  if 

tag  inl(e:  tuple): 

locked?:  bool  :=  true 

tjet:  tuple  jet  :=  tuple  _set$nil() 

t jet  :=  tuple_space$lock(t_space,  e,  w) 

except  when  refused:  locked?  :=  false  end  %  except 
send(ttinljns”,  mid,  myJd,  locked?,  tjet)  to  w 
return 

tag  in2(e:  in2-op): 

if  e.tjstamp  >  table$get_ts(tbl,  w)  then 
table$update_ts(tbl,  w,  e.tJtamp) 
tuplej3pace$deletejunlock(tjspace,  e.t,  e.s) 
end  %  if 

Figure  4.17:  Execute  Operations  Procedure  I 
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tag  unlock(e:  tuple): 

tuple_space$unlock(tjspace,  e,  w) 
send(“unlock.an8”,  mid,  myJd)  to  w 
return 

end  %  tagcase 
end  %  for 

%  If  the  last  entry  of  opsJist  is  either  an  out  or  an  in2  operation, 

%  send  a  msg  to  w.  The  other  three  cases,  rd,  ini  and  unlock,  have 
%  already  had  replies. 

tagcase  ops$top(ops_list) 

tag  out:  send(“out_an8”,  mid,  myJd)  to  w 
tag  in2:  send(“m2_ans”,  mid,  myJd)  to  w 
others:  %  ignore 
end  %  tagcase 
end  %  execute-ops 


Figure  4.18:  Execute  Operations  Procedure  II 
chapter  describes  this  algorithm. 


Chapter  5 

View  Change  Algorithm 


The  operations  protocol  explained  above  is  a  read-one-write-all  scheme,  that  is,  rd  can 
return  a  result  from  any  replica  in  the  executing  worker’s  view,  but  out  and  in  operations 
are  completed  only  if  the  executing  worker  knows  that  their  effects  are  visible  at  every  replica 
in  its  view.  Thus,  every  replica  in  the  worker’s  view  knows  all  the  completed  operations 
that  change  the  tuple  space  state. 

Network  and  node  failures  cause  some  of  the  replicas  to  be  inaccessible  from  the  workers. 
If  we  let  the  workers  access  whichever  replica  they  can  access  at  the  moment,  an  inconsis¬ 
tency  may  result.  For  example,  suppose  a  network  failure  separates  replica  r  from  the  rest  of 
the  system.  While  r  is  inaccessible,  updates  are  made  to  other  replicas.  When  the  network 
is  repaired  and  r  becomes  accessible,  r’s  state  is  out  of  date,  and  must  be  brought  up  to 
date  before  being  used  again.  The  view  change  algorithm  is  used  to  mask  the  problems  like 
this  as  well  as  to  ensure  that  updates  to  the  tuple  space  are  not  lost  during  failures. 

The  algorithm  works  roughly  as  follows:  each  replica  processes  a  view  consisting  of  the 
set  of  replicas  it  believes  that  it  can  communicate  with.  When  a  replica  discovers  that  it 
no  longer  can  communicate  with  some  replica,  or  communication  is  re-established  with  a 
replica  it  could  not  hear  from  before,  it  starts  a  view  change,  and  acts  as  the  view  change 
manager  of  the  view  change.  During  the  view  change,  the  manager  constructs  a  globally 
unique  new  viewid,  and  sends  a  message  to  all  other  replicas,  inviting  them  to  join  the  new 
view.  The  invited  replicas  can  choose  to  accept  the  invitation.  Those  that  have  accepted 
the  invitation  are  called  underlings.  If  a  majority  of  replicas  accept  the  invitation,  a  new 
view  is  formed  and  an  up-to-date  tuple  space  state  is  chosen  to  be  used  to  initialize  the 
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tuple  space  state  of  all  members  of  the  new  view.  During  a  view  change,  the  manager  and 
the  underlings  are  blocked  from  workers’  operation  requests. 

In  the  next  section,  we  introduce  the  state  information  needed  in  order  for  a  replica  to 
provide  service  to  workers'  requests  and  run  the  view  change  algorithm.  The  mechanism 
to  test  accessibility  of  replicas  is  described  in  section  5.2.  Section  5.3  gives  an  overview  of 
the  view  change  algorithm.  Each  replica  is  in  one  of  three  states:  active,  view-manager  and 
underling.  Active  replicas  execute  workers’  requests,  monitor  the  topological  changes  in 
the  network,  and  monitor  view  change  invitations.  View  change  managers  coordinate  view 
changes  while  monitoring  invitations.  Replicas  in  the  underling  state  monitor  invitations 
as  well  as  participate  in  view  changes.  The  replica  activities  in  each  of  these  states  are 
detailed  in  sections  5.4,  5.5,  and  5.6,  respectively.  Section  5.7  gives  an  example  to  illustrate 
the  view  change  algorithm.  An  informal  correctness  argument  is  stated  in  section  5.8.  We 
will  make  certain  assumptions  about  crash  failures  during  the  discussion  of  the  algorithm, 
namely  that  the  replica  state  is  stable  and  survives  crashes.  The  full  discussion  of  crashes 
is  delayed  until  section  5.9,  in  which  we  will  discuss  a  number  of  possible  solutions  to  crash 
problems.  A  possible  optimization  is  also  discussed  in  section  5.9. 

5.1  Replica  State 

The  view  change  algprithm  requires  some  information  to  be  recorded  in  the  replica  state. 
This  information  is  summarized  in  Figure  5.1  (an  extension  of  Figure  4.16). 

The  current  state  of  the  replica  is  indicated  by  status,  which  is  updated  by  the  view 
change  algprithm.  Each  replica  knows  the  current  view,  cur. view,  of  which  it  is  a  member. 
Cur-view  is  identified  by  an  unique  viewid,  cu r.viewid.  A  replica  also  keeps  a  copy  of  the 
highest  viewid  it  has  seen,  max  .viewid.  It  is  always  true  that  cur.viewid  is  less  than  or 
equal  to  max. viewid.  The  set  of  all  replicas  in  the  system  is  represented  by  origson fig, 
which  stands  for  original  the  configuration.  The  state  of  the  tuple  space  copy  is  in  tspace, 
and  the  timestamp-mid  table  described  in  the  last  chapter  is  kept  using  tbl. 

When  a  replica  is  first  created,  status  is  view-manager-,  tspace  has  the  value 
tuple  spacetnewQ-,  tbl  is  table%new()\  myJd  is  assigned  the  replica’s  id;  cur. view  and 
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status:  status  %  replica  is  active  or  doing  a  view  change 

t .space:  tuple_space  %  tuple  space  copy 

tbl:  table  %  timestamp-mid  table 

myid:  replicaJd  %  replica  id 

cur.viewid:  viewid  %  current  oiewid 

cur.view:  view  %  current  view 

max. viewid:  viewid  %  highest  viewid  seen  so  far 

orig_config:  replica  jet  %  set  of  all  replicas 

where 

status  =  oneoff active,  viewjnanager,  underling:  null] 
viewid  =  <n:  int,  r:  replica_id> 
view  =  replicajet 

Figure  5.1:  Replica  State  (Complete) 

cur. viewid  are  undefined;  max. view  id  has  the  initial  value  {0,  my.id };  and  origjcon  fig 
contains  the  ids  of  all  the  replicas  in  the  system.  One  view  change  is  necessary  to  let  the 
replicas  have  a  common  view  and  viewid  to  work  with. 

We  assume  that  the  entire  replica  state  is  stored  on  stable  storage  [19];  we  discuss  this 
assumption  in  section  5.9. 

5.2  Probes 

The  topological  changes  in  the  network  are  detected  by  sending  and  receiving  probes.  This 
is  accomplished  using  two  processes  at  each  replica,  one  that  sends  probes  and  the  other 
that  receives  them. 

The  probing  procedure  is  shown  in  Figure  5.2,  and  works  as  follows.  Probes  are  sent 
out  to  all  other  replicas  in  the  system  periodically,  one  every  probeJnterval.  Every  time  a 
probe  is  sent,  the  probing  process  waits  to  collect  the  replies.  It  adds  the  replying  replica’s 
id  in  a  temporary  set  reply^set  if  the  reply  is  to  the  current  proue.  After  a  time  interval 
62,  long  enough  for  a  round  trip  probe  in  the  normal  situation,  the  process  checks  to  see  if 
reply jset  contains  the  same  replicas  as  its  current  view.  Any  discrepancy,  while  the  replica 
is  in  the  active  state,  indicates  that  there  may  be  a  change  in  the  network’s  configuration, 


send  .probes  =  proc() 

probeJnterval:  int  :=  %  fill  in  the  appropriate  probe  period 
probejseq:  int  :=  0 
while  true  do 

if  is-active(status)  then 

for  rr:  replicaJd  in  replicajset$elements(origjconfig  -  {myJd})  do 
send(  “probe”,  myJd,  probejeq)  to  rr 
end  %  for 

replyjset:  replica-set  :=  {my Jd} 
tj:  int  :=  current_time  +  62 
while  true  do 

receive  until  tj 

probe_resp(r:  replicaJd,  m:  int): 

if  m  =  probejseq  then  reply-set  :=  reply-set  U  {r}  end  %  if 
end  %  receive 

except  when  timeout:  break  end  %  except 
end  %  while 

if  i8-active(8tatus)  cand  (reply jet  ~=  cur_view)  then 
send  change(cur.viewid)  to  myJd 
end  %  if 
end  %  if 

probejseq  :=  probe_seq  +  1 
sleep(probeJnterval) 
end  %  while 
end  send-probes 


Figure  5.2:  Send  Probe 
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monitor_probes  =  proc() 
while  true  do 
receive 

probe(r:  replicaJd,  m:  int): 

send( “probe .resp”,  myJd,  m)  to  r 
end  %  receive 
end  %  while 
end  monitor.probes 

Figure  5.3:  Monitor  Probe 

so  the  probing  process  sends  a  change  message  to  another  process  (to  be  discussed  later)  of 
the  same  replica,  which  triggers  a  view  change. 

Notice  that  said  there  may  be  a  reconfiguration  instead  of  there  is  a  reconfiguration. 
This  is  because  lo6t  or  delayed  messages  may  cause  reply  ^set  to  be  inconsistent  with  the 
replica’s  current  view.  But  occasional  message  loss  or  delay  does  not  always  mean  there  is 
a  topological  change.  Also  notice  that  probes  are  sent  only  when  a  replica  is  in  the  “active” 
state. 

The  probes  are  monitored  by  the  monitoring  processes  running  monitor  .probes  (shown 
in  Figure  5.3)  at  each  replica.  To  ensure  that  replies  correspond  to  the  current  probe,  a 
sequence  number  is  piggybacked  on  the  probing  message,  and  returned  on  the  reply.  This 
allows  the  probing  process  to  consider  only  current  replies. 

5.3  Overview  of  the  View  Change  Algorithm 

As  we  said  earlier,  the  probes  provide  a  means  of  detecting  possible  network  reconfigurations. 
Once  a  replica  believes  that  it  can  no  longer  communicate  with  the  same  set  of  replicas  it 
could  previously,  a  change  message  is  sent  by  the  probing  process.  This  message  is  received 
by  the  third  process  (on  the  same  replica)  which  in  turn  initiates  a  view  change.  The  replica 
switches  from  being  active  to  being  the  manager  of  the  view  change. 

The  view  change  algorithm  operates  in  one  and  a  half  phases.  In  the  first  phase,  the 
manager  constructs  a  new  globally  unique  viewid,  invites  all  replicas  in  the  system  to  join 
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the  new  view,  and  waits  for  responses.  A  replica  accepts  the  invitation  only  if  it  has  not 
already  received  another  invitation  to  join  a  higher-numbered  view;  each  acceptance  message 
contains  the  latest  viewid  and  a  copy  of  the  replica’s  tuple  space.  We  assume  that  a  crashed 
replica  recovers  with  its  old  state  restored.  This  assumption  guarantees  that  a  replica  either 
does  not  reply  to  an  invitation,  or  replies  with  the  tuple  space  state  that  corresponds  to  its 
current  viewid.  In  section  5.9,  we  will  discuss  mechanisms  to  support  this  assumption. 

The  manager  keeps  a  temporary  copy  of  the  tuple  space  and  the  timestamp- mid  ta¬ 
ble,  which  has  the  replica’s  own  tuple  space  copy  at  the  beginning  of  the  view  change. 
Each  incoming  acceptance  is  checked,  and  the  more  up-to-date  tuple  space  and  table  copy 
(indicated  by  the  accompanying  viewid)  is  used  to  update  that  temporary  copy.  So  the 
temporary  copy  of  the  tuple  space  is  always  the  most  up-to-date  copy  the  manager  has 
seen. 

If  less  than  a  sub-majority1  of  replicas  accept  the  invitation,  no  new  view  can  be  formed. 
The  replicas  will  repeatedly  attempt  to  form  another  view  until  a  view  change  succeeds. 
Otherwise,  the  view  change  enters  the  last  half  phase  during  which  the  manager  sends  a 
commit  message  to  all  the  replicas  that  have  agreed  to  join  the  view.  The  temporary  copy 
of  the  tuple  space  and  the  timestamp-mid  table  is  piggybacked  on  the  commit  message,  and 
is  used  to  update  the  state  of  all  the  replicas  in  the  new  view.  The  view  manager  becomes 
active  once  its  local  state  is  updated  and  the  commit  message  is  sent.  The  participating 
replicas  become  active  when  they  receive  the  commit  message  and  their  local  states  are 
updated. 

The  algorithm  is  implemented  as  the  third  process  of  a  replica  (the  first  two  being 
sending  and  monitoring  probes).  We  call  this  process  the  main  process.  The  main  process 
is  also  responsible  for  executing  a  worker’s  tuple  space  operation  requests.  Figure  5-4  shows 
the  state  diagram  of  the  view  change  algorithm. 

In  the  “active”  state,  the  replica  sends  and  monitors  probe  messages,  monitors  view 
change  invitations,  and  executes  the  operation  requests  from  the  Workers.  If  probing  triggers 
a  view  change,  the  replica  moves  to  the  “viewjnanager”  state.  If  it  receives  an  invitation 
1 A  sab-majority  is  one  less  than  a  majority. 
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Figure  5.4:  State  Diagram  for  the  View  Change  Algorithm 
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while  true  do 
tagcase  status 

tag  active:  activeQ 
tag  view_manager:  view_manager() 
tag  underling:  underling() 
end  %  tagcase 
end  %  while 

Figure  5.5:  The  View  Change  Algorithm 

to  join  a  view  with  higher  viewid  than  the  maximum  it  has  seen,  it  changes  to  “underling” 
state  and  participates  in  a  view  change. 

In  the  “viewjnanager”  state,  the  replica  coordinates  a  view  change  as  well  as  monitors 
view  change  invitations.  When  a  view  change  is  done,  it  resumes  execution  in  the  “active” 
state.  If,  during  the  view  change,  the  replica  receives  an  invitation  to  join  a  view  with  a 
higher  viewid  than  any  it  has  seen,  it  becomes  an  “underling.” 

When  a  replica  is  an  “underling,”  it  is  a  participant  in  a  view  change.  When  it  receives 
a  commit  message,  it  commits  itself  to  the  new  view  and  enters  the  “active”  state.  If  it  does 
not  receive  a  commit  message  within  a  reasonable  time,  it  becomes  to  be  a  view  manager 
and  starts  a  view  change.  If  it  receives  an  invitation  to  join  a  higher  numbered  view,  it 
accepts  the  invitation  and  remains  in  the  “underling”  state. 

Figure  5.5  shows  the  program  of  the  above  state  diagram.  It  is  structured  as  an  infinite 
loop.  The  replica  determines  its  current  state  and  calls  the  procedure  that  is  executed  while 
it  is  in  that  state.  The  next  three  sections  discuss  these  procedures. 

5.4  Active  Replicas 

Figure  5.6  shows  the  procedure  for  the  active  state.  The  main  process  in  the  “active” 
state  receives  three  types  of  messages:  change,  invite,  and  ops.  Change  messages  are  sent 
by  the  probing  process  on  the  same  replica  when  it  suspects  changes  in  communication 
capability.  Invite  messages  are  sent  by  other  replicas  when  they  start  view  changes.  Ops 
messages  are  operation  requests  sent  by  the  workers.  The  execute jops  procedure  called 
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active  =  proc() 
receive 

change(vid:  viewid): 

if  vid  <  cur.viewid  then  return  end  %  if  view  is  already  changed 
status  :=  view  .manager 
invite(vid:  viewid,  r:  replica-id): 

if  vid  <=  cur.viewid  then  return  end  %  if  an  out-dated  invitation 
max.viewid  :=  vid 

send(  “accept”,  my  id,  vid,  cur.viewid,  t  .space)  to  r 
status  :=  underling 

ops(ops:  ops_type,  mid:  int,  w:  worker Jd,  vid:  viewid): 

executejops(ops,  mid,  w,  vid) 
end  %  receive 
end  active 


Figure  5.6:  Active 

upon  receiving  an  ops  message  was  illustrated  in  Figures  4.17  and  4.18  of  the  last  chapter. 

It  is  worth  pointing  out  a  possible  race  situation  here.  Suppose  that  the  probing  process 
of  a  replica  ri  sends  a  change  message  to  the  main  process,  at  the  same  time  ri  receives  an 
invitation  to  join  a  new  higher  numbered  view  from  replica  r2.  The  main  process  of  rl  can 
nondeterministically  select  either  message  to  receive  first.  If  the  change  message  is  selected 
first,  ri  enters  the  “viewjnanager”  state  and  competes  with  r2  to  change  the  view  (we 
will  mention  the  current  view  changes  in  a  later  section).  If  rj’s  view  change  succeeds,  its 
cur.viewid  will  be  updated  to  reflect  the  new  view.  When  the  invite  message  is  processed, 
the  vid  in  the  message  is  likely  to  be  less  than  or  equal  to  cur.viewid ,  and  the  message  is 
thus  ignored.  On  the  other  hand,  if  the  invite  message  is  received  first,  ri  participates  in 
r2’s  view  change.  When  the  view  change  is  completed  and  rj  becomes  “active”  again,  its 
cur.viewid  is  updated.  This  causes  the  change  message  to  be  ignored  when  it  is  received, 
since  vid  in  the  change  message  is  the  old  cur.viewid  on  rj,  which  must  be  lower  than  the 
new  cur.viewid. 

5.5  View  Managers 

Figure  5.7  shows  the  procedure  run  by  the  view  managers.  The  local  variable  tJs  is  the 
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view_manager  =  proc() 

t_t8:  tuple_space  :=  t  .space 

t_vid:  viewid  :=  cur.viewid 

t-tbl:  table  :=  tbl 

n.view:  replica_set  :=  {myJd} 

max-viewid  :=  <max_viewid.n  +  1,  myJd> 

for  it:  replicajd  in  replica_set$elements(origjconfig  -  {myJd})  do 
8end( “invite”,  max.viewid,  myJd)  to  rr 
end  %  for 

t2:  int  :=  current-time  +  6^ 
while  true  do 

receive  until  tj 

accept(r:  replicajd,  vid,  rtn_viewid:  viewid,  ts:  tuple-space,  ta:  table): 
if  vid  =  max-viewid  then 
n_view  :=  n.view  U  {r} 
if  t.vid  <  rtn.viewid  then 
t-vid  :=  rtn.viewid 
t_ts  :=  ts 
t_tble  :=  ta 
end  %  if 

if  |n_view|  =  |orig_config|  then  break  end  %  if 
end  %  if 

invite(vid:  viewid,  r:  replicajd): 

if  vid  <=  max_viewid  then  continue  end  %  if 
max_viewid  :=  vid 

8end( “accept”,  myJd,  vid,  cur.viewid,  tjspace)  to  r 
status  :=  underling 
return 
end  %  receive 

except  when  timeout:  break  end  %  except 
end  %  while 

if  ~ismaj?(n_view,  orig-config)  then  return  end  %  if 
cur.view  :=  n.view 
cur.viewid  :=  max.viewid 
t .space  :=  t.ts 
tbl  :=  t_tbl 

for  rr:  replicajd  in  replica .set$elements(n_view  -  {myJd})  do 
send(“commit”,  cur.viewid,  cur.view,  tjspace,  tbl)  to  rr 
end  %  for 
status  :=  active 
end  view-manager 


Figure  5.7:  View  Manager 
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temporary  copy  of  the  tuple  space.  It  records  the  most  recent  copy  of  the  tuple  space  the 
manager  has  seen.  TJs  is  initialized  to  the  view  manager’s  local  tuple  space.  T.vid  keeps 
a  copy  of  the  viewid  corresponding  to  tjs.  TJbl  keeps  the  most  up-to-date  timestamp- 
mid  table.  N .view  is  a  temporary  replica  set  containing  the  ids  of  the  replicas  that  have 
accepted  the  view  change  invitation.  The  globally  unique  viewid  is  created  by  pairing  the 
sequence  number  that  is  the  successor  of  the  largest  sequence  number  in  a  viewid  seen  so 
far  and  its  replica  id. 

To  manage  a  view  change,  the  manager  first  sends  the  invitation  to  all  the  replicas  in 
origjcon fig  excluding  itself  and  then  waits  for  responses.  There  are  two  possible  types 
of  messages  to  be  received  —  accept  messages  (sent  by  the  accepting  replicas)  and  invite 
messages  (sent  by  the  replicas  that  start  new  view  changes). 

When  an  accept  message  is  received,  the  manager  checks  if  the  acceptance  is  to  the 
invitation  just  sent.  If  not,  the  message  is  ignored.  Otherwise,  n.view  is  updated  to  include 
the  id  of  the  accepting  replica,  and  tjs,  t.vid ,  and  tJLbl  are  updated  if  necessary. 

When  an  invite  message  is  received,  the  invitation  is  accepted  only  if  the  view  the 
replica  is  invited  to  join  has  a  higher  viewid  than  any  it  has  seen  so  far.  By  accepting  the 
invitation,  the  manager  abandons  the  current  view  change  in  progress  and  changes  its  state 
to  the  “underling”  to  participate  in  the  new  view  change. 

The  receiving  loop  can  be  exited  in  two  ways:  either  all  the  replicas  in  orig_con fig  have 
accepted  the  invitation  or  62  times  out.  62  is  set  up  so  that  it  is  sufficient  for  a  normal 
round-trip  of  inviting  and  accepting  messages  to  be  transmitted. 

In  order  to  form  a  new  view,  there  must  be  a  majority  of  replicas  accepting  the  invitation. 
(Recall  that  function  ismaj?(sl,  s2)  checks  if  replica  set  si  contains  a  majority  members  of 
s2.)  If  this  is  not  true,  the  current  view  change  is  abandoned,  and  the  manager  will  attempt 
to  form  another  new  view.  Otherwise,  the  manager’s  current  state  {cur. vie w,  cur. viewid, 
t space,  and  tbl)  is  updated,  and  a  commit  message  is  sent  to  all  the  accepting  replicas  along 
with  the  new  view,  viewid,  tuple  space,  and  timestamp-mid  table  copy.  Upon  completion, 
the  manager  enters  the  “active”  state. 


69 


underling  =  proc() 

t3:  int  :=  current-time  +  63 
while  true  do 

receive  until  t3 

commit(vid:  viewJd,  n.view:  view,  tsp:  tuplejspace,  ta:  table): 
if  vid  =  max_viewid  then 
cur.view  :=  n.view 
cur.viewid  :=  max_viewid 
tjspace  :=  tsp 
tbl :=  ta 
break 


end  %  if 

invite(vid:  viewid,  r:  replicaJd): 

if  vid  <=  max-viewid  then  continue  end  %  if 
max_viewid  :=  vid 

send( “accept”,  my_id,  vid,  cur.viewid,  t .space)  to  r 
return 
end  %  receive 

except  when  timeout: 

status  :=  view  .manager 

return 

end  %  except 


end  %  while 
status  :=  active 
end  underling 


Figure  5.8:  Underling 

5.6  Underlings 

Figure  5.8  shows  the  code  executed  by  the  main  process  in  the  “underling”  state.  A 
replica  becomes  an  “underling”  if  it  accepts  an  invitation  to  join  a  new  view.  While  it  is  in 
the  “underling”  state,  it  expects  to  receive  a  commit  message  from  the  manager.  It  is  also 
possible  to  receive  an  invitation  to  join  a  new  view. 

The  time  interval  63  in  the  receive  statement  is  set  in  such  a  way  that  is  sufficiently  long 
to  allow  the  acceptance  massage  to  go  from  the  underling  to  the  manager  and  the  commit 
message  to  go  from  the  manager  to  the  underling  in  the  normal  situation.  If  the  timeout 
expires,  the  underling  starts  a  new  view  change  by  switching  to  the  “view-manager”  state. 
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When  a  commit  message  is  received,  the  underling  checks  if  the  commit  request  is  for  the 
view  it  has  agreed  to  join.  If  not,  the  commit  message  is  ignored.  -Otherwise,  the  underling 
uses  the  information  piggybacked  on  the  commit  message  to  update  its  local  state  and 
switch  to  the  “active”  state. 

If  an  invitation  for  a  higher  viewid  is  received,  the  underling  accepts  the  invitation, 
ceases  its  involvement  in  the  current  view  change,  and  stays  in  the  “underling”  state  to 
wait  for  the  new  commit  message.  Otherwise,  the  invitation  is  ignored. 

5.7  Examples 

This  section  gives  an  example  to  illustrate  that  the  view  change  algorithm  is  robust  in  both 
the  simple  case  where  there  is  only  one  view  manager  coordinating  a  view  change  and  the 
case  when  multiple  view  managers  compete  to  form  new  views. 

5.7.1  Simple  Case 

Let  us  suppose  that  we  have  five  replicas  in  the  original  configuration,  rl(  r2,  r3,  r4,  and 
r5.  At  some  point,  the  view  contains  all  five  replicas  that  are  in  the  “active”  state.  Then  a 
failure  occurs,  which  makes  ri  inaccessible  from  the  other  replicas.  We  assume  for  simplicity 
that  following  the  initial  failure,  no  additional  failures  occur  during  the  view  change;  once 
ri  becomes  inaccessible,  it  remains  inaccessible  for  the  duration  of  the  algorithm. 

At  the  point  of  failure,  all  five  replicas  have  the  same  viewid  »i,  <1,  rx  >,  identifying 
view  {rx,  rj,  r3,  r4,  r5}.  When  rx  becomes  inaccessible,  the  other  replicas  stop  hearing 
from  it.  We  suppose  that  r3  detects  this  change  and  starts  the  view  change.  (More  than 
one  replica  may  detect  this  change  and  trigger  the  algorithm;  this  is  the  topic  of  the  next 
subsection).  J?3  becomes  the  view  manager  and  enters  the  first  phase  of  the  algorithm. 
It  computes  a  new  viewid  <2,  r3  >,  which  is  higher  than  anything  r3  has  seen.  Next,  it 
sends  the  invitation  message  containing  the  new  viewid  to  other  replicas  in  the  original 
configuration  and  waits  for  responses. 

Each  of  rj,  r4,  and  r$  receives  the  invitation  message  and  sends  back  an  acceptance 
message  containing,  among  other  things,  its  current  viewid,  a  copy  of  its  local  tuple  space 
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and  the  timestamp-mid  table.  No  reply  is  forthcoming  from  ri  since  it  is  inaccessible.  J?3 
collects  the  responses  and  keeps  the  most  up-to-date  state  it  has  seen.  (In  this  case,  the 
tuple  space  of  r2,  r3,  r4  and  r3  are  all  equally  up-to-date,  so  r3’s  tuple  space  will  be  used.) 

In  the  later  half  phase,  r3  forms  a  new  view  containing  r2,  r3,  r4  and  r5.  This  is  possible 
because  the  new  view  has  a  majority  of  replicas.  After  updating  its  own  local  state,  r3 
sends  a  commit  message  containing  the  new  viewid  (<2,  r3  >),  the  new  view  ({r2,  r3,  r4, 
r5}),  and  an  up-to-date  copy  of  the  tuple  space  and  the  table  to  r2,  r4  and  r$,  and  becomes 
“active”  to  accept  operation  requests,  send  and  receive  probes,  and  monitors  new  view 
changes.  When  r2,  r4,  and  rs  receive  the  commit  message,  they  update  their  local  state 
and  switch  to  the  “active”  status. 

In  the  meantime,  while  all  this  is  going  on  r2  is  also  running  the  algorithm  and  is  trying 
to  form  a  view.  As  the  view  manager,  it  computes  the  new  viewid  and  sends  invitation  mes¬ 
sages  to  the  other  replicas.  No  responses  are  forthcoming  due  to  the  communication  failure. 
It  waits  in  vain  for  acceptances  and  eventually  times  out,  remaining  in  the  “view .manager” 
state. 

In  this  scenario,  the  algorithm  forms  a  new  view  excluding  inaccessible  replicas.  The 
algorithm  works  similarly  in  the  case  of  including  replicas  that  become  accessible  when  a 
failure  is  repaired. 

5.7.2  Concurrent  View  Managers 

If,  in  the  above  scenario,  more  than  one  replica  detects  a  change  in  the  communication  capa¬ 
bility,  several  replicas  may  become  view  managers  simultaneously.  Our  view  management 
algorithm  handles  this  case  of  multiple  concurrent  view  managers  in  the  following  way. 

The  viewids  generated  by  different  replicas  are  distinct,  since  we  include  the  replica  id 
as  part  of  the  viewid.  In  the  previous  example,  let  us  imagine  that  r2  through  r5  are  labeled 
in  increasing  order.  Suppose  replicas  r2  and  r3  start  up  as  view  managers.  Rj  computes 
<2,  r2  >  and  r3  computes  <2,  r3  >.  Both  send  invitation  messages  to  everybody  else  in 
the  configuration.  The  following  events  happen: 

1.  Ri  receives  an  invitation  from  r3.  Since  <2,  r3  >  >  <2,  r2  >,  r2  accepts  the  invitation 
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and  stops  acting  as  a  view  manager. 

2.  JZ3  receives  an  invitation  from  r3.  Since  <2,  r j  >  <  <2,  r3  >,  r3  knows  of  a  higher 
viewid,  so  it  ignores  the  invitation  from  r3. 

3.  R4  and  r5  receive  invitation  messages  from  both  rj  and  r3.  If  they  receive  the  invi¬ 
tation  from  r3  first,  they  will  accept  the  invitation  and  wait  for  the  commit  message 
from  r2.  When  they  receive  the  invitation  from  r3,  they  will  stop  participating  in  the 
previous  view  change  and  start  participating  in  the  view  change  initiated  by  r3.  On 
the  other  hand,  if  r4  and  r5  receive  the  invitation  from  rs  first,  the  later  invitation, 
the  one  from  r3,  will  be  ignored  because  it  has  a  lower  viewid. 

Thus,  no  matter  in  what  order  the  messages  arrive  the  outcome  is  the  same:  rs’s  new 
viewid  is  the  one  that  prevails  because  its  viewid  is  higher.  This  conclusion  can  be  gener¬ 
alized  to  any  number  of  concurrent  managers. 

5.8  Correctness 

We  claim  that  (1)  the  effects  of  the  tuple  space  operations  either  survive  into  the  new  view 
(if  the  operations  are  completed  at  all  the  replicas  in  the  old  view)  or  will  be  retried  in  the 
new  view  (if  the  operations  are  not  completed  at  all  the  replicas  in  the  old  view),  and  (2) 
the  unrepeatable  operations  are  executed  at  most  once  across  the  view  changes. 

The  intuition  behind  the  first  claim  is  that  every  view  has  at  least  a  majority  of  replicas. 
Thus  it  contains  at  least  one  replica  that  knows  about  the  effects  of  all  operations  that 
completed  in  earlier  views.  That  replica  is  used  to  update  the  state  of  the  replicas  in  the 
new  view.  If  the  operations  are  not  completed  at  all  the  replicas  in  a  view,  the  executing 
worker  will  be  repeatedly  trying  until  all  the  replicas  in  the  current  view  have  acknowledged 
the  completion  (this  was  explained  in  the  last  chapter).  This  is  because  if  a  view  change 
takes  place  before  an  operation  is  completed  at  all  the  replicas  in  the  old  view,  the  new 
view  may  or  may  not  contain  any  replica  that  is  aware  of  the  operation. 

Repeated  attempts  to  complete  operations  do  not  imply  that  the  operations  are  executed 
more  than  once.  Duplicate  requests  for  the  same  operations  are  filtered  out  using  the 
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timestamp-mid  table  at  each  replica,  as  described  in  the  last  chapter.  Furthermore  the 
timestamp-mid  table  is  accurate  since  it  is  taken  from  the  replica  whose  tuple  space  is  used 
to  initialized  the  state  of  the  new  view.  This  satisfies  our  second  claim. 

We  are  also  interested  in  whether  the  algorithm  makes  progress,  that  is,  whether  it 
succeeds  in  forming  new  views  as  along  as  a  sufficient  number  of  replicas  can  communicate. 
Of  course,  it  can  only  make  progress  provided  that  failures  happen  rarely,  but  this  is  a 
reasonable  assumption.  To  increase  the  probability  of  a  view  change,  the  algorithm  needs 
to  be  tolerant  of  slow  responses  and  lost  messages.  For  example,  suppose  a  manager  waits 
only  until  it  hears  from  enough  replicas  to  form  a  view  even  though  there  are  other  replicas 
that  could  respond.  This  would  result  in  those  other  replicas  being  excluded  from  the  new 
view,  which  in  turn  means  another  view  change  will  occur  shortly.  If  that  next  view  change 
also  excludes  some  potential  members,  that  will  lead  to  another  view  change,  and  so  on. 

To  avoid  such  a  situation,  a  manager  should  use  a  fairly  long  timeout  while  it  waits  to 
hear  from  all  replicas  that  the  “I’m  Alive”  messages  indicate  should  reply.  Similarly,  an 
underling  should  use  a  fairly  long  timeout  before  it  becomes  a  manager.  In  addition,  it  is 
worthwhile  to  mask  lost  messages  by  sending  duplicates,  so  that  a  lost  message  will  not 
trigger  another  view  change. 

5.9  Discussion 

In  concluding  this  chapter,  we  discuss  a  number  of  approaches  to  handling  crashes  and  a 
possible  optimization. 

5.9.1  Crashes 

In  the  above  discussion,  we  assumed  that  after  a  node  crash,  a  replica  recovers  with  all  its 
pre-crash  state  restored.  That  is,  no  information  is  lost  during  crashes.  This  subsection 
discusses  two  extant  implementations,  and  gives  a  reference  to  a  method  that  can  be  used 
when  this  assumption  does  not  hold. 

An  easy  solution  is  to  provide  stable  storage  [19]  at  each  replica.  Each  replica  has 
some  form  of  nonvolatile  storage  (for  example,  disks).  The  updates  to  the  replica  state  are 
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recorded  on  the  log  in  the  order  they  occur.  The  log  is  kept  in  the  stable  storage.  During 
a  recovery  from  a  crash,  the  contents  of  the  log  are  replayed  in  the  order  they  were  stored 
to  restore  the  pre-crash  replica  state. 

Although  simple,  this  approach  is  usually  undesirable.  In  order  to  make  the  storage 
truly  stable,  duplicate  copies  are  needed.  This  makes  the  writing  unacceptably  slow.  For 
example,  if  stable  storage  is  implemented  using  two  disks,  both  disks  need  to  be  written. 
Each  update  needs  to  be  done  sequentially:  first  it  is  written  to  one  disk  and  then  that  disk 
must  be  read  to  ensure  that  the  write  happened  successfully;  then  the  same  process  must 
be  repeated  on  the  other  disk. 

An  alternative  to  the  above  approach  is  to  supply  each  replica  with  a  disk  and  an  unin¬ 
terruptible  power  supply  (UPS).  Because  of  the  UPS’s,  replicas  can  acknowledge  operations 
as  soon  as  the  information  resides  in  main  memory.  If  the  replica’s  node  crashes,  the  UPS 
will  permit  it  to  write  volatile  memory  to  disk  before  it  shuts  down. 

Oki  has  discussed  a  replication  scheme  that  uses  only  a  little  nonvolatile  or  stable  storage. 
Interested  readers  can  refer  to  [22][23]  for  a  discussion  of  this  scheme. 

5.9.2  Optimization 

Our  algorithm  uses  one-and-a-half-phases.  During  the  first  phase,  the  manager  sends  out 
invitations  to  all  other  replicas,  and  the  underlings  respond  to  the  invitations.  The  un¬ 
derlings’  current  viewids  and  their  tuple  space  copies  are  piggybacked  on  the  responses. 
During  the  last  half  phase,  the  manager  tries  to  form  a  view  and,  if  one  can  be  formed, 
sends  a  commit  message  along  with  the  selected  most  up-to-date  tuple  space  copy  to  all  the 
underlings.  No  responses  to  the  commit  messages  are  necessary. 

This  scheme  is  simple,  but  costly  in  terms  of  the  amount  of  information  being  transmit¬ 
ted  and  the  amount  of  storage  required  if  the  tuple  space  is  large.  This  is  because  the  entire 
tuple  space  and  table  are  sent  on  every  underling’s  acceptance  message  to  an  invitation, 
and  the  manager  has  to  keep  a  temporary  copy  of  the  tuple  space  and  table  in  addition  to 
its  local  copies.  An  alternative  to  the  one-and-a-half-phase  scheme  is  a  two-phase  scheme. 
During  the  first  phase,  the  manager  sends  the  invitation  to  all  other  replicas,  and  the  un- 
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derlirgs  respond  by  sending  back  their  local  viewids  only.  In  the  second  phase,  the  manager 
informs  the  replica  with  the  highest  viewid  to  distribute  its  local  copy  of  the  tuple  space 
and  table  to  all  the  replicas  in  the  new  view.  If  the  manager  itself  has  the  mo6t  recent 
viewid,  this  becomes  a  one-and-a-half-phase  scheme. 


Chapter  6 

Discussion 


In  the  previous  chapters,  we  described  a  technique  for  constructing  a  highly- available  tuple 
space  that  works  in  a  general  communications  network.  The  method  involves  little  delay 
of  workers:  a  rd  waits  for  only  one  response,  an  out  does  not  delay  the  worker  at  all, 
and  an  in  delays  the  worker  only  during  the  first  phase.  The  protocol  was  simulated  using 
Argus  [20]  on  several  VAXstations  connected  by  a  local  area  network.  Deliberate  failures 
were  generated  to  simulate  the  possible  failures  in  a  general  communications  network.  The 
simulation  survived  the  various  failures  we  were  able  to  construct. 

Our  work  contributes  in  the  following  two  areas: 

•  The  protocol  makes  it  possible  to  implement  a  highly- available  tuple  space  on  a  com¬ 
munications  network  where  nodes  may  crash  and  recover,  and  the  network  may  crash, 
partition,  and  be  repaired.  This  establishes  the  foundation  for  building  Linda  sys¬ 
tems  on  a  communications  network.  Our  research  indicates  how  fault-tolerance  might 
be  achieved  for  other  parallel  systems.  Many  parallel  computations  are  long  lived; 
fault-tolerance  is  particularly  important  for  them.  In  addition,  the  other  advantages 
of  distribution  (using  inexpensive  machines  over  a  network  and  scalability)  apply  to 
any  parallel  system. 

•  The  protocol  described  in  this  thesis  is  an  addition  to  the  general  replication  schemes 
that  provide  fault-tolerant  and  highly- available  services  in  distributed  systems.  It 
shows  what  can  be  done  when  the  semantics  of  the  operations  are  taken  in  account.  We 
were  able  to  devise  an  implementation  that  outperformed  the  general  voting  technique. 
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However,  as  discussed  below,  our  scheme  works  only  because  of  the  semantics  of  in, 
out,  and  rd;  the  addition  of  rdp  and  inp  changed  the  semantics  sufficiently  that  our 
special  optimizations  can  no  longer  be  used. 

In  the  remainder  of  this  chapter,  we  discuss  the  relationship  between  our  technique  and 
other  related  research,  additional  Linda  operations,  and  some  extensions  to  our  method  and 
areas  for  further  work. 

6.1  Related  Work 

The  only  other  Linda  kernels  that  approximate  a  distributed  kernel  are  the  S/Net  kernel 
and  the  VAX-LAN  kernel.  We  will  discuss  why  they  are  inappropriate  for  use  in  a  general 
communications  network.  We  will  also  discuss  two  other  replication  approaches  that  can 
be  alternatives  to  our  scheme:  the  voting  scheme  and  the  viewstamped  replication  scheme. 

6.1.1  S/Net  Kernel 

The  S/Net  kernel  is  described  in  detail  in  [8].  The  S/Net  consists  of  several  MC-68000’s 
with  local  memory,  connected  by  a  bus.  The  operations  are  executed  as  follows:  executing 
out(t)  causes  tuple  t  to  be  broadcast  to  every  node  in  the  network;  thus  every  node  stores 
a  complete  copy  of  the  tuple  space.  Executing  in(s)  triggers  a  local  search  for  a  matching  t. 
If  one  is  found,  the  local  kernel  attempts  to  delete  t  network-wide;  if  the  attempt  succeeds, 
t  is  returned  to  the  worker  that  executed  in(s).  If  the  attempt  fails,  the  deleted  tuples 
are  put  back  and  the  operation  is  tried  again.  An  attempt  can  fail  for  two  reasons:  (1) 
some  other  worker  has  simultaneously  attempted  to  delete  t  and  has  succeeded  on  some 
nodes;  and  (2)  some  other  worker  is  executing  a  concurrent  out  operation  and  t  has  not  yet 
reached  all  the  nodes.  If  the  local  search  triggered  by  in(s)  turns  up  no  matching  tuple,  all 
newly-arriving  tuples  are  checked  until  a  match  occurs,  at  which  point  the  matching  tuple 
is  deleted  and  returned  as  before.  Rd  works  in  the  same  way  as  in,  except  that  no  tuple 
deletion  is  attempted;  as  soon  as  a  matching  tuple  is  found,  it  is  returned  immediately  to 
the  reading  worker. 

The  S/Net  kernel  assumes  reliable  broadcast  of  messages,  so  it  does  not  tolerate  failures 
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of  network  or  nodes.  For  instance,  if  an  out’s  request  does  not  reach  all  nodes,  the  tuple 
space  becomes  inconsistent;  an  in  can  never  succeed  if  one  copy  of  the  tuple  space  becomes 
inaccessible.  In  addition,  the  S/Net  requires  that  a  copy  of  tuple  space  be  stored  on  every 
node  while  our  scheme  does  not. 

Our  protocol  performs1  as  well  as  the  S/Net  kernel’s  protocol  on  rd,  out,  and  in  opera¬ 
tions  (assuming  that  the  S/Net  executes  out  operations  in  the  background).  Both  schemes 
rd  from  one  copy  and  out  to  all  copies.  The  S/Net  kernel  executes  ins  in  one  phase;  our 
protocol  executes  ins  in  two  phases,  but  the  second  phase  is  done  in  the  background  to 
avoid  blocking  the  program  process. 

6.1.2  VAX-LAN  Kernel 

In  a  VAX-LAN  [4],  computing  nodes  are  connected  by  an  Ethernet-based  local  area  network. 
The  VAX-LAN  kernel  uses  the  following  scheme:  out(t)  stores  t  on  one  of  the  nodes;  in(s) 
activates  a  global  search  for  a  match  to  s  on  all  nodes;  rd(s)  also  requires  a  global  search. 
In  this  scheme,  out  is  simple.  In(s)  causes  the  template  s  to  be  broadcast  to  all  nodes. 
Each  node  searches  for  matching  tuples  in  its  local  memory.  If  a  matching  tuple  is  found, 
it  is  deleted  from  the  local  memory  and  shipped  to  the  template-originating  node  using  a 
point-to-point  protocol;  otherwise  the  template  is  stored  locally  for  x  ticks.  All  the  tuples 
arriving  within  these  x  ticks  are  checked,  and  matching  ones  are  sent  off.  The  template  is 
thrown  away  after  x  ticks.  If  the  template’s  originating  node  has  not  received  any  tuple  for 
x  ticks,  then  it  broadcasts  the  template  again.  If  the  originating  node  receives  more  than 
one  matching  tuple,  one  of  them  is  chosen,  and  the  rest  are  stored  on  some  nodes.  Rd(s) 
is  similar. 

Since  only  one  copy  of  each  tuple  is  stored  system- wide,  the  VAX-LAN  scheme  does  not 
provide  high-availability:  if  the  node  owning  the  tuple  t  crashes,  or  a  message  containing 
t  in  response  to  an  in(s)  is  lost,  then  t  becomes  unavailable  or,  worse,  is  lost  forever.  A 
network  partition  may  also  make  some  tuples  unavailable. 

'Our  analysis  of  performance  is  based  on  the  amount  of  messages  and  delays  at  protocol  level.  We  are 
not  able  to  make  comparisons  on  any  real  implementation  at  this  writing. 
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Our  scheme  performs  as  well  as  the  VAX-LAN  kernel  protocol  on  rd  and  out  operations. 
The  VAX-LAN  performs  better  on  in  operations  since  the  program  process  can  continue  as 
soon  as  the  first  response  arrives.  But  the  better  performance  on  in  operations  comes  from 
the  fact  that  only  one  copy  of  each  tuple  is  stored  —  the  reason  that  VAX-LAN  can  not  be 
made  highly-available. 

8.1.3  Voting 

Gifford’s  Weighted  Voting  [14]  provides  a  general  replication  method  by  dividing  a  certain 
number  of  votes ,  n,  among  replicas.  A  read  operation  has  to  acquire  a  read  quorum  of  r 
votes  and  a  write  operation  has  to  acquire  a  write  quorum  of  w  votes.  The  requirement 
that  r  -f  w  >  n  and  2 w  >  n  ensures  that  every  read  quorum  intersects  every  write  quorum 
and  that  write  quorums  intersect,  which  in  turn  implies  that  there  is  at  least  one  up-to- 
date  copy  in  both  read  and  write  quorums.  The  up-to-date  copy  is  identified  by  the  copy’s 
version  number.  In  addition  to  the  version  number,  each  copy  also  contains  its  state  and 
the  number  of  votes  assigned  to  it.  Herlihy  [16]  extended  the  above  voting  scheme  to  take 
the  advantage  of  operation  semantics,  and  thus  made  the  algorithm  more  efficient. 

Our  protocol  is  a  special  case  of  the  voting  scheme  where  the  read  quorum  is  one  and 
the  write  quorum  is  all  the  replicas.  Like  Herlihy’s  scheme,  our  method  utilizes  the  Linda 
operation  semantics  to  achieve  better  performance:  Out  operations  and  the  second  phase  of 
in  operations  are  performed  in  the  background,  which  makes  out’s  appear  to  be  zero-phase, 
and  in’s  to  be  one-phase.  This  outperforms  voting  where  all  write  operations  need  to  be 
two-phase. 

In  voting  schemes  where  writes  are  done  to  all  copies,  write  operations  cannot  be  per¬ 
formed  if  a  replica  is  down  or  inaccessible.  This  problem  was  overcome  by  the  invention  of 
the  virtual  partition  protocol  [1][2].  Our  view  change  algorithm  is  an  optimization  of  this 
protocol. 
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6.1.4  Viewstamped  Replication 

The  viewstamped  replication  scheme  is  described  in  [23] [22].  It  is  an  integration  of  a  modi¬ 
fied  primary  copy  scheme  [5]  and  the  virtual  partition  algorithm  [1].  This  method  works  as 
follows.  The  tuple  space  is  replicated.  Among  the  replicas,  there  is  a  primary  that  executes 
workers’  operation  requests.  The  updates  are  propagated  to  the  rest  of  the  replicas,  called 
backups,  in  background  mode.  Whenever  a  failure  is  detected,  the  replicas  activate  a  view 
change  algorithm  similar  to  ours.  A  new  view  is  formed  if  a  majority  of  replicas  agree  to 
join  the  new  view.  A  new  primary  is  elected  when  the  new  view  is  formed. 

To  identify  the  latest  state  of  the  new  view,  viewstamps  are  used.  A  viewstamp  is 
the  concatenation  of  the  viewid  of  the  view  in  which  the  operation  is  executed  and  the 
timestamp  of  the  operation.  The  viewstamps  help  the  view  manager  to  identify  the  replica 
that  has  the  most  up-to-date  state. 

The  viewstamped  replication  scheme  is  efficient  (the  workers  only  have  to  talk  to  one 
replica  in  the  normal  case),  fault-tolerant  (it  tolerates  common  failures  from  general  parti- 
tionable  networks),  and  highly- available  (the  data  are  replicated).  But  the  current  scheme 
is  defined  to  work  only  when  workers’  computations  run  as  atomic  transactions  [19].  How 
to  adopt  the  viewstamp  replication  scheme  to  Linda  is  a  matter  for  future  research. 

6.2  Additional  Linda  Operations 

As  mentioned  in  Chapter  2,  additional  operations  have  been  proposed  for  Linda.  A  rdp 
does  not  wait  for  a  tuple  when  none  matches;  instead  it  signals  an  exception.  Similarly,  an 
inp  does  not  wait  when  there  is  no  match,  but  instead  signals  an  exception. 

These  operations  are  not  compatible  with  our  implementation.  Our  current  scheme 
allows  a  rd  to  observe  the  results  of  a  partially  completed  out  (that  is,  an  out  that  has 
been  completed  at  only  some  of  the  replicas  in  the  current  view).  The  following  example 
illustrates  the  difficulty: 
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worker  ui  worker  z 

out(“x”,  3) 

rd(“x” ,  formal  u)  %  u  =  3 

rdp(“x”,  formal  v)  %  signals  an  exception 

When  worker  z  reads  3  into  variable  u,  this  implies  that  the  out  has  happened.  Now 
suppose  there  is  a  view  change,  and  the  effects  of  the  out  are  not  part  of  the  initial  state 
of  the  new  view.  Then  the  rdp  of  z  occurs  and  observes  that  the  out  has  not  yet  occurred. 
Note  that  this  problem  will  not  occur  if  *’s  second  operation  is  a  rd,  since  the  rd  will  simply 
wait  until  the  effect  of  the  out  can  be  observed. 

Our  implementation  could  support  these  operations  by  having  rd  (and  rdp)  read  all 
replicas  in  the  current  view,  and  only  return  a  tuple  if  it  is  in  the  intersection  of  the  tuples 
returned  by  the  replicas;  if  the  intersection  is  empty  rd  would  try  again  and  rdp  would 
signal.  However,  the  result  of  this  change  is  a  slower  implementation  than  the  one  proposed. 

The  S/Net  kernel  also  does  not  support  these  operations.  The  problem  here  comes  up 
in  the  interaction  of  in  with  rdp: 

worker  w  worker  z 

in(“x”,  formal  v)  rdp(“x”,  formal  v)  %  signals 

rdp(“x”,  formal  u)  %  returns  S 

Here  w’s  in  is  running  in  parallel  with  z' s  rdp’s.  Suppose  that  (“x”,  3)  is  in  the  tuple  space 
before  tn’s  in.  The  first  rdp  observes  the  situation  when  w  is  attempting  to  remove  the 
tuple,  and  this  tuple  has  been  removed  at  z’s  node.  However,  suppose  the  in  fails  and  the 
tuple  is  put  back.  In  this  case  the  second  rdp  observes  the  result  of  putting  the  tuple  back. 

The  VAX-LAN  kernel  could  support  rdp  and  inp,  but,  as  mentioned  earlier,  this  ap¬ 
proach  cannot  be  made  highly-available. 
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6.3  Extensions  of  Our  Scheme 

In  describing  our  system  model  in  chapter  3,  we  assumed  that  replication  is  uniform —  every 
tuple  is  replicated  onto  all  replicas.  In  this  section,  we  will  show  that  our  protocol  works 
even  when  the  tuple  space  is  not  replicated  uniformly.  We  will  also  describe  a  proposal  for 
tolerating  workers’  failures. 

6.3.1  Nonuniform  Replication 

There  are  two  problems  with  keeping  the  entire  tuple  space  at  one  set  of  replicas.  First,  if 

the  tuple  space  is  large,  then  each  node  where  a  replica  resides  must  provide  a  large  amount 

of  storage.  Second,  if  accesses  to  the  tuple  space  are  frequent,  the  replicas’  nodes  may 

become  overloaded  and  slow  down  workers  more  than  is  acceptable.  These  problems  can 

be  overcome  by  partitioning  the  tuple  space  among  different  sets  of  replicas.  The  obvious 

way  to  distribute  the  tuples  is  by  logical  name.  For  example,  all  tuples  with  logical  name 

“x”  will  be  in  set  S  and  all  those  with  logical  name  “y”  will  be  in  set  T. 

Each  set  of  replicas  operates  completely  independently  from  the  other  sets.  Each  set 

contains  its  own  replicas.  For  example  there  might  be  two  sets: 

S  =  {r!,...,r5} 

T  =  {rs,...,rio} 

Some  of  these  replicas  might  reside  at  the  same  node,  for  example,  ri  and  rr  might  both  be 
at  node  N,  but  more  likely  the  nodes  containing  the  replicas  would  be  disjoint.  The  reason 
for  this  is  that  replica  sets  are  useful,  as  mentioned  above,  for  alleviating  storage  problems 
at  nodes  and  for  reducing  contention.  These  benefits  would  not  be  obtained  if  replicas  in 
different  sets  were  located  at  the  same  node. 

When  a  worker  performs  an  operation,  it  sends  the  request  to  the  replica  set  that 
contains  information  about  that  tuple  or  template.  Obviously,  there  must  be  a  mechanism 
to  determine  what  set  to  use.  This  could  be  done  either  statically  or  dynamically.  An 
example  of  a  static  mechanism  is  a  hash  function  that  maps  logical  names  into  sets.  An 
example  of  a  dynamic  mechanism  is  a  (replicated,  highly-available)  location  server  that 
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stores  the  mapping;  workers  would  maintain  a  cache  containing  the  mapping  for  recently 
used  tuples  and  consult  the  server  only  when  there  is  a  cache  miss  or  when  the  information 
in  the  cache  is  found  to  be  out  of  date.  Implementations  of  location  servers  are  discussed 
in  [12][15][17][21]. 

The  portion  of  a  worker’s  code  that  interacts  with  replicas  would  need  to  take  the 
multiple  sets  into  account.  Operations  concerning  the  same  set  would  be  done  in  order 
just  as  described  in  Chapter  3.  Operations  that  make  use  of  different  sets  can  be  done 
in  the  background  in  parallel,  except  that  we  still  need  the  same  synchronization  we  have 
now,  namely  that  prior  in2’s  must  complete  before  an  out  can  start.  For  example,  suppose 
logical  tuple  “x”  is  stored  at  set  S  and  logical  tuple  “y”  is  stored  at  set  T.  Consider  first 

in(“x”,  . . .) 

in(“y”,  . . .) 

rd(“x”,  . . .) 

rd(“y”,  . . .) 

The  rd’s  of  “x”  and  “y”  will  not  observe  the  old  tuples  removed  by  the  respective  in’s 
because  operations  are  done  in  order  at  each  set.  Thus  at  S  we  do  in(“x”)  before  the 
rd(“x”),  and  at  T  we  do  in(“y”)  before  the  rd(“y”).  Now  consider 

in(“x”, ...) 

in(“y”,  •••) 

out(“x”,  . . .) 

The  start  of  the  out  will  be  delayed  until  both  in(“x”)  and  in(“y”)  are  completed.  This 
will  ensure  that  some  other  worker  that  observes  the  effect  of  the  out  will  not  subsequently 
be  able  to  observe  the  tuples  removed  by  either  in(“x”)  or  in(“y”). 

View  changes  occur  independently  at  each  set,  using  the  protocol  described  in  Chapter 
5. 


6.3.2  Workers’  Failures 

This  thesis  proposed  a  scheme  to  build  a  fault-tolerant  kernel  that  makes  the  Linda  tuple 
space  highly-available.  But  even  with  such  a  kernel,  Linda  programs  are  not  completely 
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fault-tolerant  since  the  failure  of  a  worker  can  cause  problems.  In  particular,  if  a  worker 
crashes  after  starting  an  ini,  but  before  completing  the  corresponding  in2,  some  tuples 
may  be  locked  forever.  This  section  proposes  a  way  to  tolerate  workers’  failures. 

What  we  would  like  is  to  release  locks  held  by  crashed  workers.  However,  as  mentioned 
earlier,  it  is  not  possible  in  general  to  distinguish  a  node  crash  from  a  partition.  Thus, 
the  absence  of  messages  from  a  worker  may  mean  either  that  it  is  crashed,  or  it  cannot 
communicate  because  of  a  partition.  Releasing  the  worker’s  locks  in  the  case  of  a  partition 
would  be  a  problem,  because  the  worker  is  still  running  and  therefore  depends  on  its  locks. 

We  can  solve  this  problem  by  forcing  a  worker  that  cannot  communicate  because  of 
a  partition  to  crash.  The  idea  is  for  replicas  to  maintain  two  views  (and  viewids):  the 
replica-view  as  discussed  earlier  in  the  thesis,  and  also  a  worker-view.  Initially  all  workers 
are  in  the  worker-view.  Replicas  send  probe  messages  to  workers  and  workers  respond  to 
these  messages.  If  a  worker  does  not  respond  to  probes  after  a  sufficient  number  of  tries,  the 
replicas  carry  out  a  worker  view  change,  during  which  all  replicas  in  the  current  replicarview 
agree  on  a  new  worker-view  and  worker- viewid.  As  part  of  the  view  change,  an  initial  state 
is  selected  for  the  new  view  as  usual,  except  that  all  locks  held  by  the  excluded  worker  are 
released.  As  is  the  case  in  any  view  change,  a  majority  of  replicas  must  participate  in  the 
view  change. 

Whenever  a  replica  receives  an  operations  request  from  a  worker,  it  checks  to  be  sure  the 
worker  is  in  the  current  worker-view.  If  not,  the  request  is  rejected,  and  the  worker  is  sent 
a  “you  must  crash”  message.  When  a  worker  receives  such  a  message  ii  stops  processing 
immediately. 

Given  this  semantics,  fault-tolerant  programs  can  be  written  in  Linda.  Figure  6.1  shows 
the  form  of  such  a  program.  The  idea  here  is  that  the  workers  collaborate  to  carry  out 
task-numbers  of  tasks;  information  about  these  tasks  is  contained  in  the  tasks  array  in  the 
tuple  space.  To  keep  track  of  what  workers  are  doing,  we  use  the  status  array  in  tuple  space. 
5tatus[t]  =  0  means  that  task  i  has  not  yet  been  worked  on;  «tatus(t]  <  0  means  that  task  i 
has  been  completed;  status[t]  >  0  means  that  task  i  is  being  worked  on.  In  this  latter  case 
the  value  of  statvLs[i]  tells  how  many  times  workers  have  attempted  to  perform  task  i. 


cnt:  array  [record  [round,  time:  intj]  %  a  local  array  at  each  worker 

%  initially  cnt[i]  =  <0,  0>  for  all  i 

while  true  do 

done:  bool  :=  true 
for  i  in  task-numbers  do 
in( “status”,  i,  formal  v) 
if  v  <  0  then 

out(“status”,  i,  v) 

continue  %  to  the  next  iteration  of  the  for  loop 

elseif  v  =  0  or  (cnt[i].round  =  v  and  cnt[i].time  <  current_time)  then 
out(“8tatu8”,  i,  v+1) 

%  do  tasksfi]  here  ... 
in( “status”,  i,  formal  v) 
out(“status”,  i,  -1) 
continue 

elseif  cnt[i].round  ~=  v  then  cnt[i]  :=  <v,  current_time  +  6> 
end  %  if 
done  :=  false 
end  %  for  loop 
if  done  then 

return  %  only  get  here  when  statusfi]  <  0  for  all  i 
end  %  if 
end  %  while 


Figure  6.1:  A  Fault  Tolerant  Worker 
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A  worker  cycles  through  the  status  array  looking  for  a  task  to  be  done.  Such  a  task 
is  either  one  that  has  never  been  attempted,  or  one  that  has  been  attempted  by  another 
worker  in  the  past  but  not  completed  within  a  reasonable  delay  6.  When  it  first  discovers  a 
task  being  worked  on  by  another  worker,  it  records  this  fact  in  its  local  cut  array,  together 
with  an  estimation  of  when  that  worker  should  complete.  If  it  later  discovers  that  the  task 
is  still  being  worked  on  by  that  worker,  but  the  time  estimate  has  been  exceeded,  it  takes 
on  the  task  itself.  In  this  case  it  stores  a  larger  round  number  in  the  status  array  to  prevent 
other  workers  from  also  redoing  the  task  at  this  point.  (The  estimated  time  of  completion 
could  be  stored  in  the  status  array  provided  we  assume  that  the  clocks  of  the  workers  are 
loosely  synchronized.  If  workers  do  not  have  clocks,  they  can  keep  track  of  how  many  times 
they  have  noticed  that  a  particular  round  for  task  i  is  occurring,  and  take  on  the  task 
themselves  when  this  number  reaches  some  maximum.) 

Note  that  there  is  an  assumption  here:  it  is  all  right  to  do  a  computation  more  than  once. 
This  is  sometime  undesirable.  If  the  computation  is  not  repeatable,  additional  techniques 
are  needed  that  allow  computations  to  run  as  atomic  actions  [11].  Adding  atomic  actions 
to  Linda  requires  further  research. 
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